Stopwords Filter
This project is a very simple and naive implementation of a stopwords filter that remove a list of banned words (stopwords) from a sentence.
Quick guide
just type
gem install
or
# Don't forget the 'require:'
gem '', require: 'stopwords'
in your Gemfile.
stopwords = ['by', 'written', 'from']
filter = Stopwords::Filter.new stopwords
filter.filter 'guide by douglas adams'.split
# ['guide', 'douglas', 'adams']
filter.stopword? 'by'
# true
- Snowball version
filter = Stopwords::Snowball::Filter.new "en"
filter.filter 'guide by douglas adams'.split
# ['guide', 'douglas', 'adams']
filter.stopword? 'by'
# true
2.1 Snowball version with Sieve class (thanks to @s2gatev)
sieve = Stopwords::Snowball::WordSieve.new
filtered = sieve.filter lang: :en, words: 'guide by douglas adams'.split
sieve.stopword? lang: :en, word: 'by'
What is a Stopword?
According to Wikipedia
In computing, stop words are words which are filtered out prior to, or after, processing of natural language data (text).
And that's it. Words that are removed before you perform some task on the rest of them.
Why would I want to remove anything?
Imagine you have a database of products and you want your customers to search on them. You can't use a proper search engine (such as Solr, Sphinx or even Google) neither full search systems from popular database systems such as PostgreSQL. You are left alone with LIKEs and %.
You have your fake search engine working. Someone searches 'Guide Douglas Adams' and you find 'Douglas Adams - Hitchhiker's guide to the galaxy' everything is perfect.
But then someone searches 'guide by douglas adams' and you don't find anything. You don't have any 'by' in the description or title of the book! Most importantly, you don't need that 'by'!
You wish you could get rid of all those 'by' or 'written' or 'from', huh? That's why we are here!
How this thing works?
Main class of this 'library' is Stopwords::Filter You just create a new object with an array of stopwords
stopwords = ['by', 'written', 'from']
filter = Stopwords::Filter.new stopwords
And then you have it, you just can filter
filter.filter 'guide by douglas adams'.split #-> ['guide', 'douglas', 'adams']
That's all?
I know what you're thinking, it takes a line of ruby code to filter one array from other. That's why we have added an extra functionality, Snowball stopwords lists, already built for you and ready to use.
At least, in the beginning we were using snowball stopwords, but several collaborators have improved this humble gem by including new languages or adding new stopwords. So now, the Snowball version is more an "Snowball and friends" version.
How do I use that snowball thing?
You just create the filter with the locale you want to use
filter = Stopwords::Snowball::Filter.new "en"
And then you filter without worrying about the exact stopwords used
filter.filter 'guide by douglas adams'.split #-> ['guide', 'douglas', 'adams']
Which languages are supported with snowball?
Currently we have support for:
- Afrikaans (af)
- Arabic (ar)
- Bengali (bn)
- Breton (br)
- Catalán (ca)
- Chinese (zh)
- Czesch (cs)
- Danish (da)
- German (de)
- Greek (el)
- English (en)
- Spanish (es)
- Finnish (fi): Due to an error it can also be used referring to the
fn
locale - French (fr)
- Hebrew (he)
- Hungarian (hu)
- Indonesian (id)
- Italian (it)
- Korean (ko)
- Dutch (nl)
- Polish (pl)
- Portuguese (pt)
- Romanian (ro)
- Russian (ru)
- Swedish (sv)
- Thai (th)
- Turkish (tr)
- Vietnamese (vi)
In the changelog you can see the collaborators for each language.
Anything else?
In a future version I would like to include a chaining filter where you include a series of operations and they are executed in a lineal order, just like the Pipes and Filters design pattern
Ackonowledgments
Thanks to @s2gatev who added the stopword?
method and the sieve class to this gem
Thanks to @bettysteger, @fauno, @vrypan, @woto, @grzegorzblaszczyk, @nerde, @sbeckeriv and @zackxu1 for language support and other features.