Security News
Opengrep Emerges as Open Source Alternative Amid Semgrep Licensing Controversy
Opengrep forks Semgrep to preserve open source SAST in response to controversial licensing changes.
hnswlib-wasm
Advanced tools
This is a WebAssembly (Wasm) version of the hnswlib index library written in C++. This Wasm port was created by @ShravanSunder using the emcc Wasm compiler, see the repository here.
Inspired by the library hnswlib-node. @yoshoku has provided some wonderful documentation for hnswlib-node. Thanks, @yoshoku!
Note: This library is still in its early days! It is being built for a use case that requires running hnswlib in the browser.
hnswlib-wasm
provides wasm bindings for Hnswlib that implements approximate nearest-neighbor search based on hierarchical navigable small world graphs. It works in browsers and is compiled with Emscripten.
$ yarn add hnswlib-wasm
See the npm package here.
indexedDB
(in browser) and uses FS from Emscripten to save and load the index via the virtual file system and IDBFS.hnswlib-node
for more details by @yoshoku changelog.First, create a runtime instance of the library:
import { loadHnswlib } from 'hnswlib-wasm';
const lib = await loadHnswlib();
Here is a full example of loading a index if it exists or creating a new index if it doesn't exist:
const filename = 'ghost.dat';
const { loadHnswlib } = await import('hnswlib-wasm');
this.hnswlib = await loadHnswlib();
this.hnswlib.EmscriptenFileSystemManager.setDebugLogs(true);
this.vectorHnswIndex = new this.hnswlib.HierarchicalNSW('cosine', 1536);
await syncFileSystem('read');
const exists = this.hnswlib.EmscriptenFileSystemManager.checkFileExists(filename);
if (!exists) {
this.vectorHnswIndex.initIndex(100000, 48, 128, 100, true);
this.vectorHnswIndex.setEfSearch(32);
this.vectorHnswIndex.writeIndex('ghost.dat');
} else {
this.vectorHnswIndex.readIndex(filename, 100000, true);
this.vectorHnswIndex.setEfSearch(32);
}
You can create the index and use it like so.
// Here you're creating a new index with the L2 distance metric and 1000 as the max number of elements
const hnswIndex = lib.HierarchicalNSW('l2', 100);
// Initialize the index with the dimensions (1536), m, efConstruction. See the section below on parameters for more details. These cannot be changed after the index is created.
index.initIndex(1536, 36, 16, 200, true);
// Set efSearch parameters. This can be changed after the index is created.
index.setEfSearch(efSearch);
// Now you can add items to the index, labels are returned as an array for the vectors. It will reuse deleted labels if possible based on the second parameter.
const labels = index.addItems(vectors, true);
// Now you can search the index
const result1 = index.searchKnn(vectors[10], 10, undefined);
// You can also search the index with a label filter
const labelFilter = (label: number) => {
return label >= 10 && label < 20;
}
const result2 = index.searchKnn(testVectorData.vectors[10], 10, labelFilter);
More usage examples to be added.
For now, see the
HierarchicalNSW.test.ts
file in the tests folder and refer to the hnswlib-node API Documentation.
The hnswlib-wasm
library provides extended support for IndexedDB (IDBFS) to store and manage the search index in the browser. This allows you to save and load the search index easily in a web environment and you don't have to move data from the emcc file system to javascript memory. It uses the FS from Emscripten to save and load the index via the virtual file system. The virtual file system is synchronized with IDBFS.
To save the search index, use the writeIndex
method:
await index.writeIndex('savedIndex');
To load a previously saved search index, use the readIndex
method:
await index.readIndex('savedIndex', false);
The syncFs
method is used to synchronize the Emscripten file system with the persistent storage IDBFS. You can use this method to save or read data from the file system's persistent source.
await lib.EmscriptenFileSystemManager.syncFS(true, emscripten::val::undefined()); // Read data from the persistent source
await lib.EmscriptenFileSystemManager.syncFS(false, emscripten::val::undefined()); // Save data to the persistent source
This section will provide an overview of the HNSW algorithm parameters and their impact on performance when using the hnswlib-wasm library. HNSW (Hierarchical Navigable Small World) is a graph-based index structure for efficient similarity search in high-dimensional spaces.
Image from pinecone.io
It has several parameters that can be tuned to control the trade-off between search quality and index size or construction time. Here are some of the key parameters.
efSearch is the size of the dynamic list for the nearest neighbors used during the search. Higher efSearch values lead to more accurate but slower searches. efSearch cannot be set lower than the number of queried nearest neighbors k and can be any value between k and the size of the dataset.
M is the number of bi-directional links created for every new element during index construction. A reasonable range for M is 2-100. Higher M values work better on datasets with high intrinsic dimensionality and/or high recall, while lower M values work better for datasets with low intrinsic dimensionality and/or low recall. The parameter also determines the algorithms memory consumption, which is roughly M * 8-10 bytes per stored element.
efConstruction controls the index construction time and accuracy. Bigger efConstruction values lead to longer construction times but better index quality. At some point, increasing efConstruction does not improve the quality of the index. To check if the selected efConstruction value is appropriate, measure recall for M nearest neighbor search when efSearch = efConstruction. If the recall is lower than 0.9, there is room for improvement.
When using hnswlib-wasm, it is essential to choose appropriate values for M, efSearch, and efConstruction based on your datasets size and dimensionality. Since hnswlib-wasm is running in the browser, you should consider the available memory and performance limitations. Here are some recommendations:
Choose a value in the range of 12-48, as it works well for most use cases. You may need to experiment to find the optimal value for your specific dataset.
Start with a value close to M and adjust it based on your desired trade-off between search speed and accuracy. Lower values will be faster but less accurate, while higher values will be more accurate but slower.
Set this value considering the expected query volume. If you anticipate low query volume, you can set a higher value for efConstruction to improve recall with minimal impact on search time, especially when using lower M values.
Remember that higher M values will increase the memory usage of the index, so you should balance performance and memory constraints when choosing your parameters for hnswlib-wasm.
Images from pinecone.io
hnswlib-wasm is available as open source under the terms of the Apache-2.0 License.
To build
yarn install
make rebuild
yarn build
To test
yarn test
Contact @ShravanSunder first!
FAQs
typescript and wasm bindings for Hnswlib
We found that hnswlib-wasm demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Opengrep forks Semgrep to preserve open source SAST in response to controversial licensing changes.
Security News
Critics call the Node.js EOL CVE a misuse of the system, sparking debate over CVE standards and the growing noise in vulnerability databases.
Security News
cURL and Go security teams are publicly rejecting CVSS as flawed for assessing vulnerabilities and are calling for more accurate, context-aware approaches.