What is cspell-trie-lib?
The cspell-trie-lib package provides utilities for working with Trie data structures, which are particularly useful for spell checking, auto-completion, and other text processing tasks. It allows for efficient storage and retrieval of words and prefixes.
What are cspell-trie-lib's main functionalities?
Creating a Trie
This feature allows you to create a Trie and insert words into it. The `has` method checks if a word exists in the Trie.
const { Trie } = require('cspell-trie-lib');
const trie = new Trie();
trie.insert('hello');
trie.insert('world');
console.log(trie.has('hello')); // true
console.log(trie.has('world')); // true
console.log(trie.has('hell')); // false
Finding Words with a Given Prefix
This feature allows you to find all words in the Trie that start with a given prefix. The `findWithPrefix` method returns an array of words that match the prefix.
const { Trie } = require('cspell-trie-lib');
const trie = new Trie();
trie.insert('hello');
trie.insert('hell');
trie.insert('heaven');
trie.insert('heavy');
const wordsWithHe = trie.findWithPrefix('he');
console.log(wordsWithHe); // ['hello', 'hell', 'heaven', 'heavy']
Removing a Word from the Trie
This feature allows you to remove a word from the Trie. The `remove` method deletes the specified word, and the `has` method can be used to verify its removal.
const { Trie } = require('cspell-trie-lib');
const trie = new Trie();
trie.insert('hello');
trie.insert('world');
trie.remove('hello');
console.log(trie.has('hello')); // false
console.log(trie.has('world')); // true
Other packages similar to cspell-trie-lib
trie-prefix-tree
The trie-prefix-tree package provides similar functionality for creating and managing Trie data structures. It supports insertion, deletion, and searching for words and prefixes. Compared to cspell-trie-lib, trie-prefix-tree offers a more straightforward API but may lack some advanced features.
dawg-lookup
The dawg-lookup package implements a Directed Acyclic Word Graph (DAWG), which is a more space-efficient alternative to a Trie. It supports similar operations like insertion, deletion, and prefix searching. DAWG structures are generally more memory-efficient but can be more complex to implement and manage compared to traditional Tries.
radix-tree
The radix-tree package provides a compact and efficient implementation of a Radix Tree, which is a space-optimized version of a Trie. It supports similar operations such as insertion, deletion, and prefix searching. Radix Trees are generally more memory-efficient and can be faster for certain operations compared to traditional Tries.
cspell-trie
Trie library for use with cspell
This library allows easily building of a Trie
from a word list.
The resulting trie can then be compressed into a
DAFSA|DAWG.
Installation
npm install -S cspell-trie-lib
File Format V3
TrieXv3
base=10
# Comments
__DATA__
The header has two parts.
TrieXv3
-- the format identifier.- base -- references are stored using the base (10, 16, 32) are common.
higher the base, the smaller the file. Max is 36
Data
The data is a stream of characters and operators. Each character represents a node in the Trie. The operators adjust the position in the Trie.
Conceptual Format
Given a sorted list of words:
joust
jouster
jousting
joy
joyful
joyfuller
joyfullest
It is possible to think of the same list stored as a series of operations.
op | Meaning |
---|
< | remove 1 character |
<< | remove 2 characters |
<<< | remove 3 characters |
<2 | remove 2 characters |
<3 | remove 3 characters |
$ | end of word |
_ | visual place holder |
joust$
_____er$
_____<<
_____ing$
__<<<<<<
__y$
___ful$
______ler$
________<
________st$
Becomes:
joust$er$<2ing$<6y$ful$ler$<st$
Trie:
j─o┬u─s─t┬$
│ ├e─r─$
│ └i─n─g─$
└y┬$
└f─u─l┬$
└l─e┬r─$
└s─t─$
Data Format
op | Meaning |
---|
< | remove 1 character |
<n | remove n characters where n is [2-9] to remove 12 characters use <9<3 |
$ | end of word |
\ | escape next character. All characters can be escaped. \\ -> \
\# -> #
\a -> a |
#n; | reference to an already imported trie node where n is the node number |
Sample Data
Big Apple$8races\: \{\}\[\]\(\)$9<5
New York$7umbers \0\1\2\3\4\5\6\7\8\9$9<9
ap#6;<rrow \<$7
big a#5;<4urned$r$2ing$3s$$4
chalk#56;<3u#54;<3
eol \\n$3w \$$4scape \\\$8
fun journey$7wal#27;<7
journalism$tic$2$3s$$2eyer$2man$2e#103;<2$4ste#101;<i#58;<$3vialit#85;<2$4wly$$2yfuller$st$4ness$4$3lessn#120;<$4ou#125;<2ridde#103;<2er$$i#58;<3od#8;<3
stic#27;<4$3
lift#56;<3ong w#86;<6
ref \#$5
t#61;<
wa#62;<2
File Format V1
TrieXv1
base=10
The header has two parts.
- TrieXv1 -- the identifiers
- base -- offsets are stored using the base (10, 16, 32) are common.
higher the base, the smaller the file. Max is 36
Data
The first line of data is always a *
Each line is a node in the Trie.
The format of each line is:
star [char index [, char index]*]
- star - the presence of a star indicates that the node is the ending of a word.
- char - a character that can be appended to the word followed by the node at index.
- index - the offset in the list of nodes to continue appending
In other words, each line has an optional *
followed by 0 or more (char, index) pairs.
A missing index implies an index of 0, which is the end of word flag.
Example Line: *s1,e
-- The word can stop here, or add an s and continue at node 1, or add an e
Example:
Word List:
- walk
- walked
- walker
- walking
- walks
- talk
- talks
- talked
- talker
- talking
becomes
Output: (Offsets are added for clarity, but do not exist in output)
Offset Output
------- --------
TrieXv1
base=10
0 *
1 d,r
2 g
3 n2
4 *e1,i3,s
5 k4
6 l5
7 a6
8 t7,w7
The root of the trie is the last offset, 8.
It is designed for the entire trie to be in memory, which is why the root is at the end.
This allows for efficiently building the trie as the file loads line by line, because
each line can only refer to previous lines.
How to walk the data to see if "talks" is in it.
- Start with the root at offset 8.
- t found goto 7
- a found goto 6
- l found goto 5
- k found goto 4
- s found stop (goto 0 is stop).