Fast ternary string set
A fast, space-efficient, serializable string set based on ternary search trees, with both exact and approximate membership tests.
API Docs
Features
- Drop-in replacement for nearly any use of a JavaScript
Set
of strings. - Search and iteration methods return elements in ascending sorted (lexicographic) order.
- Set relations (equality, subset, superset) and operations (union, intersection, difference, symmetric difference).
- Several approximate matching methods:
- List strings that complete a prefix.
- List strings that can be made from a list of letters.
- List strings that match a pattern including "don't care" letters (as
.
in a regular expression). - List strings within a certain Hamming distance of a pattern.
- Serialize sets to and from an
ArrayBuffer
. - Time and space efficient:
- Leverages common JS engine optimizations under the hood.
- Elements share tree nodes and do not retain references to the original strings.
- Read-only sets can be compacted to save even more space.
- Well-documented TypeScript source, targeting modern JavaScript by default.
- Backed by extensive test suites.
- Use as a standalone/ECMAScript module or as a Node.js/CommonJS module.
- No other dependencies.
Installation
To install the latest stable version with npm
:
npm install fast-ternary-string-set
Or, if using yarn
:
yarn add fast-ternary-string-set
To use it without Node.js, you can simply copy the main source file (src/index.ts
) into any TypeScript project, rename to something sensible, and then import
it into your code as usual.
Examples
Complete API docs are available. Here are some examples of common tasks to get you started:
Loading the module:
const { TernaryStringSet } = require("fast-ternary-string-set");
import { TernaryStringSet } from "fast-ternary-string-set";
Create a new set and add some strings:
const set = new TernaryStringSet();
set.add("dog").add("cat").add("eagle");
set.addAll(["aardvark", "beaver", "dog", "fish", "hamster"]);
set.has("cat");
set.delete("cat");
set.has("cat");
set.has(123.456);
Create a new string set from any Iterable<string>
:
let otherSet = new Set(["fish", "hippo"]);
let set = new TernaryStringSet(otherSet);
set.has("hippo");
Iterate over all elements in sorted order:
for (const el of set) {
console.log(el);
}
set.forEach((el) => console.log(el));
Get all elements that start with "sha"
:
set.getCompletionsOf("sha");
Get all elements that can be made from the letters of "taco"
:
set.getArrangementsOf("taco");
Get all elements within Hamming distance 1 of "cat"
:
set.getWithinHammingDistanceOf("cat", 1);
Get all elements that match "b.t"
:
set.getPartialMatchesOf("b.t");
Compare sets:
let s1 = new TernaryStringSet(["a", "b", "c"]);
let s2 = new TernaryStringSet(s1);
s1.equals(s2);
s1.isSubsetOf(s2);
s2.add("d");
s1.equals(s2);
s1.isSubsetOf(s2);
s1.isSupersetOf(s2);
Serialize to or from a buffer:
const buff = set.toBuffer();
const set = TernaryStringSet.fromBuffer(buff);
A simple spelling checker:
const words = fs.readFileSync("./wordlist.txt", "utf8").split("\n");
const dict = new TernaryStringSet(words).add("");
dict.balance();
dict.compact();
fs.writeFileSync("./dict.bin", dict.toBuffer());
const dict = TernaryStringSet.fromBuffer(fs.readFileSync("./dict.bin"));
const toCheck = "In a time long past lived a cat named Captain Peanut."
toCheck.toLowerCase().split(/\W+/).forEach((word) => {
if (!dict.has(word)) {
console.log(`misspelled: ${word}`);
}
});
Usage notes
Differences from standard JS Set
TernaryStringSet
supports a superset of the standard Set
interface, but it is not a subclass of Set
.
JavaScript Set
objects guarantee that they iterate over elements in the order that they are added.
TernaryStringSet
s always return results in sorted order.
TernaryStringSet
s can contain the empty string, but cannot contain non-strings. Not even null
or undefined
.
Tree quality
Adding strings in sorted order produces a worst-case tree structure. This can be avoided by adding strings all at once using the constructor or addAll()
. Given sorted input, both of these methods will produce an optimal tree structure. If this is not practical, adding strings in random order usually yields a near-optimal tree. Calling balance()
will rebuild the tree in optimal form, but it can be expensive.
Similarly, after deleting a large number of strings, future search performance may be improved by calling balance()
.
Since most TernaryStringSet
methods are recursive, extremely unbalanced trees can provoke "maximum call stack size exceeded" errors.
Matching by code point
Some Unicode code points span two characters (char codes) in a JavaScript string. For example, the musical symbol 𝄞, code point U+1D11E, can be assigned to a JavaScript string as follows:
const clefG = "\ud834\udd1e";
Even though it represents a single symbol, the above string has a length of two. To avoid surprises, TernaryStringSet
matches by code point, not by char code. For example, since the above string is one code point, it would match getPartialMatchesOf(".")
and not getPartialMatchesOf("..")
.
Compaction
Calling compact()
can significantly reduce a set's memory footprint. For large sets of typical strings, typical results are a 50–80% reduction in size. However, no new strings can be added or deleted without undoing the compaction. Compaction is expensive, but can be a one-time or even ahead-of-time step for many use cases.
Serialization
A common use case is to match user input against a fixed set of strings. For example, checking input against a spelling dictionary or suggesting completions for partial input. In such cases it is often desirable to build a set ahead of time, serialize it to a buffer, and then save the buffer data on a server where it can be downloaded as needed. Recreating a set directly from buffer data is much faster than downloading a file containing the strings and inserting them into a new set on the client.
The following steps will make such ahead-of-time sets as small as possible:
- Create a set and insert the desired strings using
add()
or addAll()
. - Minimize the tree size by calling
balance()
followed by compact()
. - Create the buffer with
toBuffer()
and write the result to a file. - Optionally, compress the result and configure the server to serve the compressed version where supported by the browser.
To recreate the set, download or otherwise obtain the buffer data, then use TernaryTreeSet.fromBuffer(data)
.
Developing
Building from source requires Node.js. Clone the repository, then install development dependencies:
npm install
The TypeScript source is found under src
. Compiled output is written to lib
. To build the project:
npm run build
The included tsconfig.json
targets ES2015 (ES6). To target old JavaScript engines or browsers you will need to modify this configuration and/or use a tool like Babel.
The project includes an extensive suite of tests under src/tests
. To run all tests:
npm test
HTML documentation can be prepared automatically using TypeDoc:
npm run doc
Before submitting a pull request, format, lint, and run all tests:
npm run format && npm run lint && npm test