Security News
GitHub Removes Malicious Pull Requests Targeting Open Source Repositories
GitHub removed 27 malicious pull requests attempting to inject harmful code across multiple open source repositories, in another round of low-effort attacks.
FTSet, or Full Text Set, is an array-like data structure comparable to the javascript Set
object but designed exclusively for storing strings and providing ultra fast full-text searching.
FTSet is something in between a Set and an Array, it is 100% compatible with the Set API and most of the Array API.
Recently i found out that matching lines using linux's native grep
command on a text file is much faster than doing the same thing with the entire file loaded in memory as an array of lines, which essentially makes caching this kind of data not only useless but it actually made my entire application slower.
FTSet provides fast string matching features by storing its entire dataset as a single contiguous string block which makes the process of string matching an order of magnitude faster than looping over arrays, like an in-memory grep.
FTSet excels at storing a large number of strings with the purpose of quickly searching through them but may not be indicated for heavy read/write usage, check the benchmarks at the end of this readme.
Loading the full list of 500k+ catalogued asteroids in our solar system, downloaded from https://www.astro.com/ftp/swisseph/ephe/seasnam.txt;
Basic preparations:
npm install ftset
const { performance } = require("perf_hooks");
const { readFileSync } = require("fs");
const { execSync } = require("child_process");
const FTSet = require("ftset");
// approx. 10mb and 500k lines
let file = readFileSync("./seasnam.txt", "utf8");
// define "\n" as the delimiter, enabling the file to be used as is, no conversion or parsing needed
let fts = new FTSet(file, "\n");
// create an array of 500k items to compare with
let array = file.split("\n");
Case sensitive search for a single item:
time = performance.now();
result = array.find(n => n.includes("Eris"));
console.log(performance.now() - time);
console.log(result);
// => 10.449599981307983
// => 136199 Eris
time = performance.now();
result = fts.find("Eris");
console.log(performance.now() - time);
console.log(result);
// => 0.38879990577697754
// => 136199 Eris
// ~ 30x faster
Case insensitive regex global search:
// using array.filter() with regex match
time = performance.now();
result = array.filter(n => n.match(/eris/i));
console.log(performance.now() - time);
console.log(result);
/*
=> 30.59500002861023
=> [
'000871 Amneris',
'028688 Diannerister',
'084225 Verish',
'121633 Ronperison',
'136199 Eris',
'237845 Neris'
]
*/
// micro-optimized arrays
reg = /eris/i;
r = [];
time = performance.now();
for(let i = 0, x = 0, b = array.length; i < b; i++) {
if(array[i].match(reg)) {
r[x++] = array[i];
}
};
console.log(performance.now() - time);
console.log(r);
/*
=> 21.258599996566772
=> [
'000871 Amneris',
'028688 Diannerister',
'084225 Verish',
'121633 Ronperison',
'136199 Eris',
'237845 Neris'
]
*/
// about 30% performance gains but still slower than grep
// using grep for comparison
time = performance.now();
result = child_process.execSync("LC_ALL=C grep -i -F 'eris' seasnam.txt", {encoding:"utf8"});
console.log(performance.now() - time);
console.log(result);
/*
=> 13.034613132476807
=> 000871 Amneris
028688 Diannerister
084225 Verish
121633 Ronperison
136199 Eris
237845 Neris
*/
// much faster than any iteration method
// using FTSet.matchAll()
time = performance.now();
result = fts.matchAll(/eris/i);
console.log(performance.now() - time);
console.log(result);
/*
=> 3.0929999351501465
=> [
'000871 Amneris',
'028688 Diannerister',
'084225 Verish',
'121633 Ronperison',
'136199 Eris',
'237845 Neris'
]
*/
// ~ 8x faster than arrays
// ~ 4x faster than grep
new FTSet(data, delimiter)
-> FTSetCreate a FTSet instance.
data
-> A string, array or iterable to fill the instance with. Defaults to empty string.delimiter
-> A string to act as a delimiter. Defaults to the EOT character "\u0004"..delimiter
-> StringGetter for the current delimiter in use. Also a setter for redefining the delimiter across the entire dataset.
.size
-> NumberGetter for the number of items stored in the dataset. Updated on insert and delete operations.
.length
-> NumberGetter for the total string length of the entire dataset including delimiters.
.valueOf()
-> NumberAlias for .length
. Called automatically when coerced to number.
.toString()
-> StringReturns the entire dataset as a string including delimiters. Called automatically when coerced to string.
.toArray()
-> ArrayReturns the entire dataset as an array.
.first()
-> StringReturns the first item stored in the dataset.
.last()
-> StringReturns the last item stored in the dataset.
.has(string)
-> BooleanChecks if an exact match exists in the dataset. Returns a Boolean
indicating if the item exists.
string
-> A non-empty string to check..delete(string)
-> BooleanDelete an exact match from the dataset. Returns a boolean indicating if the delete was successful.
string
-> A non-empty string to delete from the dataset..clear()
-> VoidDeletes all items from the dataset.
.random()
-> StringReturns a random item from the dataset.
.find(query)
-> StringReturns the first item that includes the query. Case sensitive.
query
-> A string to search for..findAll(query)
-> ArrayReturns all items that include the query. Case sensitive.
query
-> A string to search for..match(regex)
-> StringReturns the first match found. Case sensitive unless the i
flag is used.
regex
-> A regular expression to search for..matchAll(regex)
-> ArrayReturns all matches found. Case sensitive unless the i
flag is used.
regex
-> A regular expression to search for..add(string)
-> NumberAlias of .push()
for compatibility with the javascript Set
object.
.push(string)
-> NumberInsert an item at the end of the dataset. Returns the total number of items after inserting.
string
-> A non-empty string to add to the dataset..unshift(string)
-> NumberInsert an item at the beginning of the dataset. Returns the total number of items after inserting.
string
-> A non-empty string to add to the dataset..pop()
-> StringRemove an item from the end of the dataset. Returns the removed item.
.shift()
-> StringRemove an item from the beginning of the dataset. Returns the removed item.
.concat(data)
-> thisAppends a number of items to the end of the dataset. Returns the current instance after appending.
data
-> The data to append, can be a String
, an Array
, an instance of FTSet
or an Iterable
..clone()
-> FTSetReturns a new instance of FTSet containing a full copy of the entire dataset.
.forEach(fn)
-> voidExecutes a function on each item of the dataset.
fn(item)
-> The function to execute. First argument is the current item being iterated..map(fn)
-> FTSetExecutes a function on each item of the dataset and returns a new instance of FTSet
from the results.
fn(item)
-> The function to execute. First argument is the current item being iterated..cursor(string)
-> ObjectCreates a cursor for navigating the dataset. If a string argument is given, the cursor will be set to the first item found using the .find()
method, otherwise it will start from the beginning. Returns a cursor object containing three functions, current()
(returns the current item), previous()
(moves the cursor to the previous item and returns it) and next()
(moves the cursor to the next item and returns it). The latter two functions will return null
once the end is reached.
string
-> An optional string to define as the initial item..values()
-> IterableReturns an Iterable
for compatibility with standard javascript classes.
.keys()
-> IterableAlias of .values()
for compatibility with standard javascript classes.
.entries()
-> IterableAlias of .values()
but calls back with [value,value]
instead of value
. For compatibility with standard javascript classes.
.reverse()
-> IterableReturns an Iterable
in reverse order, starting from the last item and iterating backwards until the first item.
[Symbol.iterator]
-> IteratorFTSet implements the Iterable
and Iterator
protocols so you can loop or iterate over an instance of FTSet directly.
Sample test run using the script in bench/bench.js
on Node.js 15.3.0. Your milleage may vary depending on your CPU, your node.js version and several other factors so you are advised to perform your own tests.
Insert operations are surprisingly on pair with arrays, except for array.unshift which is absolutely terrible with a large number of items. Arrays are not designed for adding items to the beginning because the entire array needs to be reindexed, FTSet on the other hand has no problems with it. Sets are unordered and are slower because all items are hashed to check for uniqueness on every insert.
99999 x 10 char | 9999 x 1000 char | |
---|---|---|
array.push() | 53.147 | 1571.854 |
set.add() | 71.045 | 1911.027 |
FTSet.push() | 54.402 | 1595.230 |
array.unshift() | 17773.719 | 2077.918 |
FTSet.unshift() | 52.290 | 1519.934 |
Search operations are what FTSet is designed for, delivering much faster performance accross the board.
99999 x 10 char | 9999 x 1000 char | |
---|---|---|
array.find(includes) near beginning | 0.081 | 0.061 |
FTSet.find() near beginning | 0.017 | 0.012 |
array.find(includes) near end | 6.319 | 159.263 |
FTSet.find() near end | 0.565 | 2.734 |
array + Math.random | 0.046 | 0.009 |
set + Array.from + Math.random | 0.681 | 0.051 |
FTSet.random() | 0.078 | 0.009 |
array.filter(includes) | 6.594 | 4.273 |
FTSet.findAll() | 0.367 | 3.142 |
array.filter(regex) | 6.615 | 8.152 |
FTSet.matchAll() | 0.943 | 7.665 |
Delete operations have mixed results, with Sets being the fastest at deleting, Arrays at popping and FTSet at shifting.
99999 x 10 char | 9999 x 1000 char | |
---|---|---|
array.splice() | 0.344 | 0.787 |
set.delete() | 0.051 | 0.006 |
FTSet.delete() | 0.188 | 0.365 |
array.pop() | 3.361 | 2.282 |
FTSet.pop() | 13.276 | 17.775 |
array.shift() | 5133.597 | 3.788 |
FTSet.shift() | 8.616 | 3.083 |
Iterating over the entire data set is faster with arrays as expected, with some strange behavior coming from Sets.
99999 x 10 char | 9999 x 1000 char | |
---|---|---|
for(let i of array) | 4.210 | 2.130 |
for(let i of set) | 5.485 | 169.628 |
for(let i of FTSet) | 9.509 | 3.165 |
array.forEach() | 3.875 | 1.251 |
set.forEach() | 3.085 | 1.106 |
FTSet.forEach() | 4.746 | 3.225 |
array.map() | 4.540 | 1.533 |
FTSet.map() | 5.516 | 4.597 |
javascript string concatenation makes use of v8's ConsString class which improves performance by not concatenating immediately and instead batch-processing several concats at once when needed. This may cause the first read operation in js to be much slower than normal when performed after a write operation because the read will force any pending concatenations in v8 to be processed before the string becomes readable. This slowness scales with data size, therefore FTSet may not be recommended for general purpose read/write usage on large datasets, it works much better for read-only or read-many/write-few use cases in those situations.
Example of the delayed first read using the above asteroids dataset:
time = performance.now();
result = fts.find("Eris");
console.log(performance.now() - time, result);
// => 0.39865487397210759 136199 Eris
// write some data
fts.push("abc");
time = performance.now();
result = fts.find("Eris");
console.log(performance.now() - time, result);
// => 6.963500022888184 136199 Eris
// first read up to 20x slower due to lazy concatenation
// generally should still faster than array.find()
time = performance.now();
result = fts.find("Eris");
console.log(performance.now() - time, result);
// => 0.38657467654108987 136199 Eris
// performance is back to normal from second read onwards until the next write
FAQs
A Set-like data structure designed for ultra fast full-text searching
We found that ftset demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
GitHub removed 27 malicious pull requests attempting to inject harmful code across multiple open source repositories, in another round of low-effort attacks.
Security News
RubyGems.org has added a new "maintainer" role that allows for publishing new versions of gems. This new permission type is aimed at improving security for gem owners and the service overall.
Security News
Node.js will be enforcing stricter semver-major PR policies a month before major releases to enhance stability and ensure reliable release candidates.