Groupie
Groupie is a simple way to group texts and classify new texts as being a likely member of one of the defined groups. Think of bayesian spam filters.
The eventual goal is to have Groupie work as a sort of bayesian spam filter, where you feed it spam and ham (non-spam) and ask it to classify new texts as spam or ham. Applications for this are e-mail spam filtering and blog spam filtering. Other sorts of categorizing might be interesting as well, such as finding suitable tags for a blog post or bookmark.
Started and forgotten in 2009 as a short-lived experiment, in 2010 Groupie got new features when I started using it on a RSS reader project that classified news items into "Interesting" and "Not interesting" categories.
Current functionality
Current funcionality includes:
- Tokenize an input text to prepare it for grouping.
- Strip XML and HTML tag.
- Keep certain infix characters, such as period and comma.
- Add texts (as an Array of Strings) to any number of groups.
- Classify a single word to check the likelihood it belongs to each group.
- Do classification for complete (tokenized) texts.
- Pick classification strategy to weigh repeat words differently (weigh by sum, square root or log10 of words in group)
Installation
Add this line to your application's Gemfile:
gem 'groupie'
You can also perform this to do this for you:
bundle add groupie
And then execute:
bundle install
Or install it system-wide via:
gem install groupie
Usage
Here is an annotated console session that shows off the features available in Groupie.
groupie = Groupie.new
groupie[:spam].add(%w[this is obvious spam please buy our product])
groupie[:spam].add(%w[hello friend this is rich prince i have awesome bitcoin for you])
groupie[:ham].add(%w[you are invited to my awesome party just click the link to rsvp])
tokens = Groupie.tokenize('Please give me your password so I can haxx0r you!')
groupie[:spam].add(tokens)
test_tokens = %w[please click the link to reset your password for our awesome product]
groupie.classify_text(test_tokens)
groupie.classify_text(test_tokens, :log)
groupie.classify_text(test_tokens, :sqrt)
groupie.classify_text(test_tokens, :unique)
test_tokens - (test_tokens & groupie.unique_words)
groupie.smart_weight = true
groupie.default_weight
groupie.classify_text(test_tokens)
Persistence can be naively done by using YAML:
groupie = Groupie.new
groupie[:spam].add(%w[assume you have a lot of data you care about])
require 'yaml'
yaml = YAML.dump(groupie)
loaded = YAML.safe_load(yaml, permitted_classes: [Groupie, Groupie::Group, Symbol])
For I'm still experimenting with Groupie in Infinity Feed, so persistence is a Future Problem for me there. In development, I'm building (low data count) classifiers in memory and discarding them after use.
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment. Rubocop is available via bin/rubocop
with some friendly default settings.
To install this gem onto your local machine, run bundle exec rake install
.
To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and the created tag, and push the .gem
file to rubygems.org. For obvious reasons, only the project maintainer can do this.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/Narnach/groupie.
License
The gem is available as open source under the terms of the MIT License.