Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
com.illucit:instatrie
Advanced tools
Trie Data Structure for high-performance String search and filtering
High-performance prefix index and search word highlighter, implemented with trie data structures.
To use InstaTrie
in your Maven project, you can simply add instatrie
as dependency:
<dependency>
<groupId>com.illucit</groupId>
<artifactId>instatrie</artifactId>
<version>2.0.0/version>
</dependency>
The only dependency of instatrie
is the library lucene-utils
for String normalization, which has no further dependencies itself.
A "Trie" is a tree-like data structure, which has a character on each edge and where each node represents a prefix. It is known to be extremely efficient for storing and retrieving all words with a common prefix. For further theoretical information about tries, you can look in the Wikipedia.
The interface com.illucit.instatrie.trie.PrefixDictionary<T>
and class com.illucit.instatrie.trie.Trie<T>
provide an implementation of this data structure, which can hold a set of stored Strings and can retrieve all matching Strings with a given prefix.
For each inserted String you can also add a payload object of the generic type <T>
.
Although the Trie
class was basically implemented for the Prefix Index functionality of this library, you can use it as its own.
The Trie
class does not implement java.util.Collection
, but provides the most common methods to manipulate data. See the public
interface com.illucit.instatrie.trie.PrefixDictionary<T>
for a full list of supported operations.
The "Prefix Index" data structure resembles a generic collection of models of a specific type, which can be filtered by search queries.
First of all, you define a function to extract a searchable String from each model, which will be indexed. This function is called a "Word Splitter" and
has to be realized by a class that implements.com.illucit.instatrie.splitter.WordSplitter<T>
.
E.g. if your model type is a class Person
with properties firstName
and lastName
, your Word Splitter should take an instance of Person
and return a Set containing the first name and last name of the person.
You can also use the more convenient class com.illucit.instatrie.splitter.StringWordSplitter<T>
,
for which you only have to prove a function get one single String from your model and the class will provide all "indexable" text fragments by a regular expression (per default all
connected alphanumeric substrings) from this String representation. This also has the advantage that you don't need to provide an individual splitting into different words (e.g. if your person has multiple first names), but you only give the regular expression a String to work on.
E.g. for your example with the Person
class:
WordSplitter<Person> splitter;
splitter = new StringWordSplitter<>(p -> p.getFirstName() + " " + p.getLastName());
So if you have a Person
with firstName = "John Walter"
and lastName = "Doe"
, your Splitter will provide a set with "John"
, "Walter"
and `"Doe"``.
Note: If your model contains values with HTML tags which you want to make searchable (and therefore include in the splitter), make sure you remove the tags before (e.g. with an HTML parser utility like JSoup).
If you actually use the StringWordSplitter
class, the given String can be normalized, i.e. special unicode characters are replaced by their "basic" characters where possible,
e.g. "ö"
-> "o"
, "Â"
-> "A"
or "©"
-> "c"
.
The interface com.illucit.instatrie.index.PrefixIndex<T>
and class com.illucit.instatrie.index.TriePrefixIndex<T>
provide an implementation
for the "Prefix Index" collection type. The collection basically consists of an (ordered) list of models of the generic type "<T>"
, but provides additional
methods to query the list of models for a search String and get back a (reduced, still ordered) list of models that contain this search query.
You can create and instance of the com.illucit.instatrie.index.TriePrefixIndex<T>
class with an empty index by providing a Splitter for "<T>"
.
Once the prefix index instance is created, you can add data to it by calling the TriePrefixIndex.createIndex(Collection<T> models)
method with a new list of models, which will overwrite the previous index and build a new index with the given collection and the fixed word splitter.
TriePrefixIndex<Person> index = new TriePrefixIndex<>(wordSplitter);
List<Person> personList = ...
index.createIndex(personList);
The interface com.illucit.instatrie.index.PrefixSearc<T>
describes the basic operations to query the search index for data.
The underlying implementation provides the queried data as Java8 streams, but there are also return function as Lists.
In order to use functional composition of the search data, the PrefixSearch
interface has filter
and map
methods,
which return a dependent PrefixSearch
instance coupled to the index of the original PrefixSearch
but applying the
given mapping- or filter-function to each results.
The "Prefix Index" collection provides two basic query operations: search
and searchExact
. While the operation search
will retrieve all models which
match all given search terms as prefix, the operation searchExact
will get you all models which contain all given search terms as complete words. Searching is
alsways case-insensititve, so the case of the model search terms and the query String won't matter.
First of all, the search query will itself always be splitted into individual words with the help of a StringSearchSplitter
. So if you search for multiple words at once,
it is always considered as "AND" search, meaning all searched terms must be found for a hit. There is currently no support for query operations, as
the "Prefix Index" is built for filtering and narrowing something down, not as full-fledged search engine.
"DOE"
will only search for the word "doe"
."doe, john"
, "john doe"
and "......(joHN]] DOe?"
will all result in the search for the two terms "john"
and "doe"
.When one of the search methods is called, the search words are compared with the splitted words in the index for each model individually,
so a model in the index can either match the criteria or not.
Given the Person
model mentioned above, the model has the index terms "john"
, "walter"
and "doe"
.
"doe"
the model will match, as the word "doe"
is fully contained in the set of index terms"walt"
the model will not match, as the word "walt"
is only a prefix of the index term "walter"
, but not fully contained in the index terms"walt"
the model will match, because "walt"
is a prefix of the index term "walter"
"jane"
and "doe"
the model will not match, because "jane"
is not a prefix of an index term (although "doe"
is)Note:If a model matches an exact query, it always also matches the non-exact query, because each word is considered a prefix of itself
The PrefixSearch
interface does not directly extend a class from the Java Collection API. It provides, however, a method decorateAsList
which will return a proxy object which implements java.util.List<T>
and is still connected to the original PrefixSearch
instance.
All write-operations of the List
-proxy will result in UnsupportedOperationException
, and all read-operations will be delegated to the
internal list of models contained in the current index of the PrefixSearch
.
The returned proxy is permanently connected to the original TriePrefixIndex
instance, so when you rebuild the index on one of the instances,
it will also change the index of the other instance. All Trie query operations on both instances will return exactly the same result.
The class TriePrefixIndex
is designed for concurrent access. An instance of TriePrefixIndex
always takes the most recent available index data.
If you call the createIndex
method, until the method is finished all method call querying the prefix index will still be going to the current index.
Once the index calculation is done, it will be set as most recent index for the TriePrefixIndex
in an atomic operation and all future query calls will use the new index.
You can use a TriePrefixIndex
instance from multiple threads simultaniously without gettings "concurrent modification" problems, as each query call will use the index data
from the beginning of the call during the whole execution of the query.
If multiple Threads share an instance of TriePrefixIndex
and one thread refreshes the index, all Threads will get the new index from the moment it is calculated completely.
In Java8 it is possible to use "Lambda Expressions" to provide implementations of interfaces with only 1 unimplemented method (like java.util.Function
).
Instead of providing a WordSplitter
object when creating a TriePrefixIndex
, you can also directly pass a lambda expression for a StringWordSplitter
in the constructor:
TriePrefixIndex<Person> index = new TriePrefixIndex<>(p -> p.getFirstName() + " " + p.getLastName());
The PrefixIndex
also provides a method searchStream
which behaves exactly like the search
method, but instead of returning a List
of models, it will return
a Java8 java.util.stream.Stream<T>
containing the models.
When you use the PrefixIndex
to filter down elements in a graphical user interface (GUI), especially on a web page, it is helpful to see where a model matched the search terms.
For this, the InstaTrie library also provides String highlighting capabilities.
The interface com.illucit.instatrie.highlight.SubwordHighlighter
contains methods to generate a String segmentation of a given model String property,
in which the highlighted parts can be identified. The default implementation of SubwordHighlighter
is the class StringWordSplitter
, also used in splitting.
The class TriePrefixIndex
also implements the interface SubwordHighlighter
and delegates all its calls to the StringWordSplitter
used
for splitting the search queries.
There are two different modi in which the highlighting can run: "HTML mode" and "Text mode". The "HTML mode" is designed for the cases in which the searchable
model properties can contain simple HTML markup, e.g. bold, underlined, subscript etc. Because InstaTrie has no bundled HTML parser, only simple tags without attributes
and with distinct closing tags (i.e. no "empty" tags) are supported. These tags are ignored when calculating the search highlights, even if the highlights span across multiple tags.
In "Text mode" the given model String is taken "as-is". In both cases the matching is ignoring case and also ignoring unicode specialities (unless deactivated
in the StringWordSplitter
), so that e.g. "DØĚ"
will match "doe"
.
To use the highlighting, you can call the methods highlightSubwordPrefixes(value, queryWords)
or highlightSubwordPrefixesWithHtml(value, queryWords)
on a SubwordHighlighter
instance. The first parameter is the model String value that you want to render with highlighting and the second parameter is a Set
with the
search termn that were used for search, as returned by the WordSplitter
.
The TriePrefixIndex
class also provides the methods getHighlighted(modelValue, query)
and getHighlightedHtml(modelValue, query)
which do the splitting
of the query String query
into the queryWords
set for you with the internal query splitter.
The returned value is an instance of the class com.illucit.instatrie.highlight.HighlightedString
which contains a method getSegments()
returning a list of all
highlighted and non-highligted segmenty, you can use for rendering.
E.g.
Person john = new Person("John Walter", "Doe");
Person jane = new Person("Jane", "Doe");
TriePrefixIndex<Person> index = ...
String search = "walt do";
HighlightedString highlighted;
highlighted = index.getHighlighted(john.getFirstName(), search);
for (HighlightSegment segment : highlighted.getSegments()) {
if (segment.isHighlighted()) {
System.out.print("[[" + segment.getValue() + "]]");
} else {
System.out.print(segment.getValue());
}
}
System.out.println();
// Prints: John [[Walt]]er
highlighted = index.getHighlighted(john.getLastName(), search);
for (HighlightSegment segment : highlighted.getSegments()) {
if (segment.isHighlighted()) {
System.out.print("[[" + segment.getValue() + "]]");
} else {
System.out.print(segment.getValue());
}
}
System.out.println();
// Prints: [[Do]]e
highlighted = index.getHighlighted(jane.getFirstName(), search);
for (HighlightSegment segment : highlighted.getSegments()) {
if (segment.isHighlighted()) {
System.out.print("[[" + segment.getValue() + "]]");
} else {
System.out.print(segment.getValue());
}
}
System.out.println();
// Prints: Jane
There are some feature ideas which didn't make it into the current implementation:
FAQs
Trie Data Structure for high-performance String search and filtering
We found that com.illucit:instatrie demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.