Triez
Pragmatic tries for Ruby, spelled in lolcat.
It is fast, memory efficient, unicode aware, prefix searchable, and enchanced with prefix/suffix/substring keys.
The backend of triez is a cache oblivious data structure: the HAT trie (In fact it is a modified version for improved functionality). HAT trie is generally faster and more memory efficient than double array or burst trie.
Requirement
- CRuby 1.9 / 2.0
g++
or clang
Install
gem ins triez
Synopsis
require 'triez'
t = Triez.new
t['foo']
t = Triez.new value_type: :int64
t = Triez.new value_type: :object
t.value_type
t = Triez.new value_type: :object, default: 'hello'
t['key'] = 100
t << 'key'
t.change_all(:suffix, 'key') {|old_value| ...calculate new value }
t.change_all(:prefix, 'key') {|old_value| ...calculate new value }
t.change_all(:substring, 'key') {|old_value| ...calculate new value }
t.size
t.has_key? 'key'
t['key']
t.search_with_prefix(prefix, limit: 10, sort: true) do |suffix, value|
...
end
t.search_with_prefix('prefix')
t.each do |key, value|
...
end
t.walk string do |k, v|
...
end
* Note: By default, triez store signed integers within 64bits, you can use them as weights, counts or database IDs. In case you need to store arbitrary object in a node, use value_type: :object
:
t = Triez.new value_type: :object
t['Tom'] = {name: 'Tom', sex: 'Female'}
t['Tree'] = [:leaf, :trunk, :root]
Examples
Prefix based autocompletion:
require 'triez'
words = %w[readme, rot, red, rah, rasterization]
t = Triez.new
words.each do |word|
t[word] = 1
end
t.search_with_prefix 're' do |suffix|
puts "candidate: re#{suffix}"
end
The output:
candidate: readme
candidate: red
Efficient full text search with a suffix tree:
require 'triez'
sequences = {
'ACTGAAAAAAACTG' => 1,
'ATACGGTCCA' => 2,
'GCTTGTACGT' => 3
}
t = Triez.new
sequences.each do |seq, id|
t.change_all(:suffix, seq){id}
end
t.search_with_prefix 'CGGT' do |_, id|
puts id
end
The searching time is linear to the length of the substring. You may also be interested in the example of a simple full text search server with triez.
Solve the longest common substring problem:
require 'triez'
sentences = %w[
万塘路一锅鸡
去文二路一锅鸡吃饭
来一锅鸡顶盒
一锅鸡胗
]
t = Triez.new value_type: :object, default: 0
sentences.each_with_index do |sentence, i|
elem = 1 << i
t.change_all :substring, sentence do |v|
v | elem
end
end
# longest common substring
lcs = ''
# find the key tagged with universe
universe = (1 << sentences.size) - 1
t.each do |k, v|
lcs = k if k.size > lcs.size and v == universe
end
puts lcs #=> 一锅鸡
Benchmark
Here's a benchmark on
ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-darwin12.2.1]
2.3 GHz Intel Core i7
The test data are 3 milion titles of wikipedia articles (from http://dumps.wikimedia.org/enwiki/20121101/)
thing/backend | memory | insertion time | 3 M query
------------------------|---------|----------------|----------
hash/linked hash | 340.2 M | 4.369 s | 0.2800 s
fast_trie/double array* | 155.6 M | 130.7 s | 0.4359 s
triez/HAT trie | 121.7 M | 3.872 s | 0.3472 s
Note: fast_trie/double array
-> https://github.com/tyler/trie
Caveats
- The
sort
option in prefixed search orders keys with binary collation, but string comparison in Ruby is with unicode codepoint collation. - For some rare case of many threads modifying the same trie, you may need a mutex.
- If you still feel memory not enough, you may consider MARISA-trie (note that MARISA is immutable), or a database.
Development
git clone git://github.com/luikore/triez.git
cd triez
rake glob_src
rake
To update vendor lib and re-compile:
rake glob_src
rake
Note
Although HAT trie uses MurMurHash3 instead of SipHash in Ruby, It is still safe under hashDoS because bucket size is limited.