Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

tokkens

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

tokkens

0.1.1
Rubygems

Version published: 8 years ago

Maintainers: 1

Created: 8 years ago

Source

Tokkens

Tokkens makes it easy to apply a vector space model to text documents, targeted towards with machine learning. It provides a mapping between numbers and tokens (strings).

Read more about installation, usage or skip to an example.

Installation

Add this line to your application's Gemfile:

gem 'tokkens'

And then execute:

$ bundle

Or install it yourself as:

$ gem install tokkens

Note that you'll need Ruby 2+.

Usage

Tokens

`get` and `find`

Tokens is a store for mapping strings (tokens) to numbers. Each string gets its own unique number. First instantiate a new instance.

require 'tokkens'
@tokens = Tokkens::Tokens.new

Then get a number for some tokens. You'll notice that each distinct token gets its own number.

puts @tokens.get('foo')
# => 1
puts @tokens.get('bar')
# => 2
puts @tokens.get('foo')
# => 1

The reverse operation is find (code is optimized for get).

puts @tokens.find(2)
# => "bar"

The prefix option can be used to add a prefix to the token.

puts @tokens.get('blup', prefix: 'DESC:')
# => 3
puts @tokens.find(3)
# => "DESC:blup"
puts @tokens.find(3, prefix: 'DESC:')
# => "blup"

`load` and `save`

To persist tokens across runs, one can load and save the list of tokens. At the moment, this is a plain text file, with one line containing number, occurence and token.

@tokens.save('foo.tokens')
# ---- some time later
@tokens = Tokkens::Tokens.new
@tokens.load('foo.tokens')

`limit!`

One common operation is reducing the number of words, to retain only those that are most relevant. This is called feature selection or dimensionality reduction. You can select by maximum max_size (most occuring words are kept).

@tokens = Tokkens::Tokens.new
@tokens.get('foo')
@tokens.get('bar')
@tokens.get('baz')
@tokens.indexes
# => [1, 2, 3]
@tokens.limit!(max_size: 2)
@tokens.indexes
# => [1, 2]

Or you can reduce by minimum min_occurence.

@tokens.get('zab')
# => 4
@tokens.get('bar')
# => 2
@tokens.indexes
# => [1, 2, 4]
@tokens.limit!(min_occurence: 2)
@tokens.indexes
# => [2]

Note that this limits only the tokens store, if you reference the tokens removed elsewhere, you may still need to remove those.

`freeze!` and `thaw!`

Tokens may be used to train a model from a training dataset, and then use it to predict based on the model. In this case, new tokens need to be added during the training stage, but it doesn't make sense to generate new tokens during prediction.

By default, Tokens makes new tokens when an unrecognized token is passed to get. But when it has been frozen? by freeze!, new tokens will return nil instead. If for some reason, you'd like to add new tokens again, use thaw!.

@tokens.freeze!
@tokens.get('hithere')
# => 4
@tokens.get('blahblah')
# => nil
@tokens.thaw!
@tokens.get('blahblah')
# => 5

Note that after loading, the state may be frozen.

Tokenizer

When processing sentences or other text bodies, Tokenizer provides a way to map this to an array of numbers (using Token).

@tokenizer = Tokkens::Tokenizer.new
@tokenizer.get('hi from example')
# => [1, 2, 3]
@tokenizer.tokens.find(3)
# => "example"

The prefix keyword argument also works here.

@tokenizer.get('from example', prefix: 'X:')
# => [4, 5]
@tokenizer.tokens.find(5)
# => "X:example"

One can specify a minimum length (default 2) and stop words for tokenizing.

@tokenizer = Tokkens::Tokenizer.new(min_length: 3, stop_words: %w(and the))
@tokenizer.get('the cat and a bird').map {|i| @tokenizer.tokens.find(i)}
# => ["cat", "bird"]

Example

A basic text classification example using liblinear can be found in examples/classify.rb. Run it as follows:

$ gem install liblinear-ruby
$ ruby examples/classify.rb
How many students are in for the exams today? -> students exams -> school
The forest has large trees, while the field has its flowers. -> trees field flowers -> nature
Can we park our cars inside that building to go shopping? -> cars building shopping -> city

The classifier was trained using three training sentences for each class. The output shows a prediction for three test sentences. Each test sentence is printed, followed by the tokens, followed by the predicted class.

MIT license

Contributing

Fork it ( https://github.com/[my-github-username]/tokkens/fork )
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Make sure the tests are green (rspec)
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

FAQs

What is tokkens?

Is tokkens well maintained?

Package last updated on 08 Feb 2017

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

tokkens

Tokkens

Installation

Usage

Tokens

get and find

load and save

limit!

freeze! and thaw!

Tokenizer

Example

MIT license

Contributing

Related posts

Threat Actor Exposes Playbook for Exploiting npm to Build Blockchain-Powered Botnets

NVD Backlog Tops 20,000 CVEs Awaiting Analysis as NIST Prepares System Updates

`get` and `find`

`load` and `save`

`limit!`

`freeze!` and `thaw!`