WordStats
WordStats provides a set of methods useful for counting character and word frequencies.
Installation
Add this line to your application's Gemfile:
gem 'word_stats'
And then execute:
$ bundle
Or install it yourself as:
$ gem install word_stats
Usage
Require the WordStats gem as follows:
require 'word_stats' # Remember to require Ruby Gems first if using Ruby 1.8
text = "The quick brown fox jumps over the lazy dog."
# Note: all strings processed by WordStats are downcased!!
WordStats provides shortcuts for single letter frequencies, bigrams and trigrams. The WordStats::Characters.ngrams(n,text)
method can be used to find n-grams of any length. The output is a hash of the form [:word,count].
letter_frequencies = WordStats::Characters.letters(text)
letter_frequencies[:'u'] #=> 2
bigrams = WordStats::Characters.bigrams(text)
bigrams[:'th'] #=> 2
trigrams = WordStats::Characters.trigrams(text)
trigrams['qui'.to_sym] #=> 1
octocats = WordStats::Characters.ngrams(8,text)
octocats[:'The quic'] #=> 0
octocats[:'the quic'] #=> 1
Similarly, WordStats provides a method to count words and any arbitrary length sequence of words:
word_count = WordStats::Words.nwords(1,text)
word_count[:'the'] #=> 2
word_pairs = WordStats::Words.nwords(2,text)
word_pairs[:'quick brown'] #=> 1
Important Notes
WordStats will downcase any string that you pass into it. It also strips punctuation before processing.
Contributing
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Added some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request