Trie Substring Search
Installation
Implementation of text syntax analyzer, which represents text as non symetric trie of chars or substrings by dictionary or list of keywords. Similar to Aho-Corasick algorithm, but with modifications, there is removed suffixes and added recovering text from the trie.
Add this line to your application's Gemfile:
gem 'trie-substring-search'
And then execute:
$ bundle
Or install it yourself as:
$ gem install trie-substring-search
Usage
dictionary = %w[he she her his him he they their she]
tss = TSS::Trie.new(dictionary, :full)
tss.parse('he their them height have then their shelter')
tss.extend_dictionary(%w[our it them])
vertex = tss.root.get_child('s').get_child('h').get_child('e')
vertex.end_indexes
tss.backtrace_to_word(vertex)
Index of word in dictionary can be used to get relations with additional array with external data(or collection, or can be easily replaced by foreign key in the future). If dictionary contain duplicates, then you will get few indexes in result.
Benchmark
dictionary: 100000 words
number of executions with uniq text: 44555
results of benchmark:
user system total real
1.749733 0.116561 1.866294 ( 1.876876)
Development
Contributing
TBD Improvements:
- TODO: Mode providing ability to return whole words, that contain substring from dictionary
- TODO: Suffix references is not implemented. Just now it contains full syntax tree, so it is less memory efficient then original Aho-Corasick algorithm.
- TODO: Maybe, will be good to implement next features for trie build stage:
- post optimization of vertex connections, creating suffixes
- some parallelism
- pre optimization of a dictionary.