Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

stopwords-filters

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

stopwords-filters

  • 1.0.0
  • Rubygems
  • Socket score

Version published
Maintainers
1
Created
Source

Stopwords Filter

Build Status

This project is a very simple and naive implementation of a stopwords filter that remove a list of banned words (stopwords) from a sentence.

Quick guide

  • Install

just type

gem install 

or

# Don't forget the 'require:'
gem '', require: 'stopwords'

in your Gemfile.

  • Use it

    1. Simple version
stopwords = ['by', 'written', 'from']
filter = Stopwords::Filter.new stopwords

filter.filter 'guide by douglas adams'.split
# ['guide', 'douglas', 'adams']

filter.stopword? 'by'
# true
  1. Snowball version
filter = Stopwords::Snowball::Filter.new "en"
filter.filter 'guide by douglas adams'.split
# ['guide', 'douglas', 'adams']

filter.stopword? 'by'
# true

2.1 Snowball version with Sieve class (thanks to @s2gatev)

sieve = Stopwords::Snowball::WordSieve.new

filtered = sieve.filter lang: :en, words: 'guide by douglas adams'.split
# filtered = ['guide', 'douglas', 'adams']

sieve.stopword? lang: :en, word: 'by'
# true

What is a Stopword?

According to Wikipedia

In computing, stop words are words which are filtered out prior to, or after, processing of natural language data (text).

And that's it. Words that are removed before you perform some task on the rest of them.

Why would I want to remove anything?

Imagine you have a database of products and you want your customers to search on them. You can't use a proper search engine (such as Solr, Sphinx or even Google) neither full search systems from popular database systems such as PostgreSQL. You are left alone with LIKEs and %.

You have your fake search engine working. Someone searches 'Guide Douglas Adams' and you find 'Douglas Adams - Hitchhiker's guide to the galaxy' everything is perfect.

But then someone searches 'guide by douglas adams' and you don't find anything. You don't have any 'by' in the description or title of the book! Most importantly, you don't need that 'by'!

You wish you could get rid of all those 'by' or 'written' or 'from', huh? That's why we are here!

How this thing works?

Main class of this 'library' is Stopwords::Filter You just create a new object with an array of stopwords

stopwords = ['by', 'written', 'from']
filter = Stopwords::Filter.new stopwords

And then you have it, you just can filter

filter.filter 'guide by douglas adams'.split  #-> ['guide', 'douglas', 'adams']

That's all?

I know what you're thinking, it takes a line of ruby code to filter one array from other. That's why we have added an extra functionality, Snowball stopwords lists, already built for you and ready to use.

At least, in the beginning we were using snowball stopwords, but several collaborators have improved this humble gem by including new languages or adding new stopwords. So now, the Snowball version is more an "Snowball and friends" version.

How do I use that snowball thing?

You just create the filter with the locale you want to use

filter = Stopwords::Snowball::Filter.new "en"

And then you filter without worrying about the exact stopwords used

filter.filter 'guide by douglas adams'.split  #-> ['guide', 'douglas', 'adams']

Which languages are supported with snowball?

Currently we have support for:

  • Afrikaans (af)
  • Arabic (ar)
  • Bengali (bn)
  • Breton (br)
  • Catalán (ca)
  • Chinese (zh)
  • Czesch (cs)
  • Danish (da)
  • German (de)
  • Greek (el)
  • English (en)
  • Spanish (es)
  • Finnish (fi): Due to an error it can also be used referring to the fn locale
  • French (fr)
  • Hebrew (he)
  • Hungarian (hu)
  • Indonesian (id)
  • Italian (it)
  • Korean (ko)
  • Dutch (nl)
  • Polish (pl)
  • Portuguese (pt)
  • Romanian (ro)
  • Russian (ru)
  • Swedish (sv)
  • Thai (th)
  • Turkish (tr)
  • Vietnamese (vi)

In the changelog you can see the collaborators for each language.

Anything else?

In a future version I would like to include a chaining filter where you include a series of operations and they are executed in a lineal order, just like the Pipes and Filters design pattern

Ackonowledgments

Thanks to @s2gatev who added the stopword? method and the sieve class to this gem

Thanks to @bettysteger, @fauno, @vrypan, @woto, @grzegorzblaszczyk, @nerde, @sbeckeriv and @zackxu1 for language support and other features.

FAQs

Package last updated on 06 Sep 2024

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc