Hanny
Hanny is a Hash-based Approximate Nearest Neighbor (ANN) search library in Ruby.
Hash-based ANN converts vector data into binary codes and builds a hash table by using the binary codes as hash keys.
To build the hash table, Hanny uses Locality Sensitive Hashing (LSH) of approximating cosine similarity.
It is known that if the code length is sufficiently long (ex. greater than 128-bit), LSH can obtain high search performance.
In the experiment, Hanny achieved about twenty times faster search speed than the brute-force search by Euclidean distance.
Installation
Add this line to your application's Gemfile:
gem 'hanny'
And then execute:
$ bundle
Or install it yourself as:
$ gem install hanny
Documentation
Usage
require 'hanny'
targets = Numo::DFloat.new(5000, 512).rand
queries = Numo::DFloat.new(10, 512).rand
index = Hanny::LSHIndex.new(code_length: 256)
index.build_index(targets)
candidates = index.search_knn(queries, n_neighbors: 10)
candidates = index.search_radius(queries, radius: 4)
query_id = 0
distances = Hanny::Utils.euclidean_distance(queries[query_id, true], targets[candidates[query_id], true])
appended_data_ids = index.append_data(new_data)
removed_data_ids = index.remove_data([0, 1, 2])
File.open('index.dat', 'wb') { |f| f.write(Marshal.dump(index)) }
index = Marshal.load(File.binread('index.dat'))
Experiment
I confirmed the search speed of Hanny's LSH with MNIST data set.
The experiment is carried out on MacBook Early 2016 (Core m3 1.1 GHz CPU and 8 GB memory).
Code:
require 'benchmark'
require 'rumale'
require 'hanny'
samples, labels = Rumale::Dataset.load_libsvm_file('mnist')
queries = samples[0..5, true]
targets = samples[6..-1, true]
qlabels = labels[0..5]
tlabels = labels[6..-1]
index = Hanny::LSHIndex.new(code_length: 128, random_seed: 1)
index.build_index(targets)
n_queries = queries.shape[0]
n_neighbors = 5
Benchmark.bm 50 do |r|
r.report 'LSH' do
candidates = index.search_knn(queries, n_neighbors: n_neighbors)
n_queries.times do |m|
STDERR.write("\nquery label: %d, neighbors label: " % qlabels[m])
candidates[m].each { |n| STDERR.write("%d, " % tlabels[n]) }
end
STDERR.write("\n")
end
r.report 'Brute-force' do
distance_mat = Hanny::Utils.euclidean_distance(queries, targets)
candidates = Array.new(n_queries) do |n|
distance_mat[n, true].to_a.map.with_index.sort_by(&:first).map(&:last)[0...n_neighbors]
end
n_queries.times do |m|
STDERR.write("\nquery label: %d, neighbors label: " % qlabels[m])
candidates[m].each { |n| STDERR.write("%d, " % tlabels[n]) }
end
STDERR.write("\n")
end
end
Result:
user system total real
LSH
query label: 5, neighbors label: 5, 5, 5, 5, 5,
query label: 0, neighbors label: 0, 0, 0, 0, 0,
query label: 4, neighbors label: 4, 4, 4, 4, 4,
query label: 1, neighbors label: 1, 1, 1, 1, 1,
query label: 9, neighbors label: 9, 9, 9, 9, 9,
query label: 2, neighbors label: 2, 2, 2, 2, 2,
0.290000 0.010000 0.300000 ( 0.307445)
Brute-force
query label: 5, neighbors label: 5, 5, 5, 3, 5,
query label: 0, neighbors label: 0, 0, 0, 0, 0,
query label: 4, neighbors label: 4, 4, 4, 4, 4,
query label: 1, neighbors label: 1, 1, 1, 1, 1,
query label: 9, neighbors label: 9, 9, 9, 9, 9,
query label: 2, neighbors label: 2, 2, 2, 2, 2,
6.350000 0.280000 6.630000 ( 6.682365)
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/yoshoku/Hanny. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.
License
The gem is available as open source under the terms of the BSD 2-clause License.
Code of Conduct
Everyone interacting in the Hanny project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.