Security News
tea.xyz Spam Plagues npm and RubyGems Package Registries
Tea.xyz, a crypto project aimed at rewarding open source contributions, is once again facing backlash due to an influx of spam packages flooding public package registries.
eclipsesearch
Advanced tools
A quick, well-featured full-text search engine that works in-browser and in Node.js.
Readme
Features • Design • Credit • License
EclipseSearch is a full-text search engine written entirely in Javascript. This means it can work on the back-end (e.g. in Node.js) and on the front-end (in browser).
View the online demos here!
You can easily search your database with queries that have different weights for different fields in structured documents. This is useful, as typically users want searching for titles priorities over, say, descriptions.
Partial word search is also supported without the need for more expensive fuzzy search.
Fuzzy search allows for fault-tolerance (particularly spelling mistakes) within queries to the database, allowing applications to have a better user experience.
Various variables that the fuzzy search relies on can be changed, to create your optimum fuzzy search experience.
You can easily fine-tune the constants in the BM25F-based ranking algorithm (in fuzzy and regular search), allowing it to be configured to your specific purpose.
Both the b-value (per field in a structured document) and the k-value can be adjusted, allowing for you to create an optimal ranking algorithm for your purpose.
Structured search can be used to search for different terms in specified fields, or terms in only specified fields, allowing for more accurate and specific search.
Operators can also be used to combine queries, to create even more detailed and advanced queries. Multiple operators exist including "OR" and "AND".
Multiple languages can be specified and used to parse documents and queries, meaning many languages can be potentially supported.
Each different language can have a different stemmer, list of stop words, word splitter and punctuation remover.
Searching with EclipseSearch requires starting a model controller, which is provided with a document schema outlining the design of the document. Each model controller is isolated from other model controllers. Searching can then be performed on these controllers.
When documents are provided to a model controller, they are first parsed with the choosen language, which consists of passing them through a pipeline, including: a punctuation remover, a word spitter, a stop word remover and a stemmer. Each word is then put through a tokenizer, so partial word matching can be performed. Each document must have a unique document ID. The processed document is then stored in indexes. If configured as such, the engine will also store select parts of the documents to return to the user on a search. If not, only the IDs of the documents will be returned.
Searches can then be performed. A ranking algorithm, as detailed below will be applied to each document to determine the order in which to determine return them, and if a limit is applied, which documents to return. This ranking algorithm can be disabled. Differnet fields in the document can have different weights (significance if terms are found there) and whole documents can have different boosts (increasing the score for the entire document if search terms are found there).
Given a query , which contains the terms , the engine first fetches any indexed documents which contain any of the terms, and then ranks each with the ranking algorithm, and then returns the documents with the highest score. EclipseSearch uses a ranking algorithm that is essentially based off of BM25F, a modified version of BM25 that allows different fields to have different weights, as seen below:
where is defined as the following:
where is the boost applied to the document, is a constant, which has the be greated than 0, which by default is set to 2, is the total number of documents, is the number of documents in which occurs, is the number of times of the term in the field of the document, is the weight applied to the field, is a constant for the field , which is set between 0 and 1, is the number of terms in the field for the document and is the average length of the field across all documents.
As said above is set as . The larger the value of for a field, the more aggressively the search normalizes for frequency of each term. For example, in the movies demo if you search for "d" with a value of 1, it will return "Addicted" first as that has the highest density of the search term considering the number of words, but if you search for "d" with a value of 0, it returns "Hi Diddle Diddle" as that has the highest number of the search term, regardless of the number of words.
As said above has to be . A lower value of means the algorithm gives a high score to documents which cover many of the terms, whereas a high value of means the algorithm gives a high score to the documents with the highest term frequency.
The fuzzy search ranking algorithm is slightly different:
All notation is the same as the above equation, apart from the additional notation that returns a set of possible words that are with Levenshtein distance of the term , returns the Levenshtein distance between the two provided arguments and is a constant that determines how much a term's Levenshtein distance matters for ranking. The only logical values are , with a value closer to 0 means a more aggressive adjustment for distance.
If you want to use EclipseSearch on the desktop with Node.js, you can install EclipseSearch through NPM, like so:
$ npm i eclipsesearch
And then require it in a file:
const EclipseSearch = require("eclipsesearch");
Alternatively, you can use it in the browser, by adding the following script before (in the HTML) any other script referencing EclipseSearch:
<script src="https://unpkg.com/eclipsesearch/dist/EclipseSearch.min.js"></script>
Basic usage could then follow like so:
const engine = new EclipseSearch();
const model = searchEngine.model({
name: {
type: "text",
weight: 10,
language: "en",
return: true,
},
description: {
type: "text",
return: true,
},
price: {
index: false,
return: true,
},
id: "id",
});
await model.createDocument({
name: "Cake",
description: "A very nice cake",
id: 1,
price: 5,
});
let results = await model.search("cake");
Note: the options provided to EclipseSearch will waterfall to each ModelController, which will waterfall to each function, however any alternatives provided at any stage of this waterfall will override the waterfalled options.
EclipseSearch:
ModelController:
LanguageModel:
This is the entry point to the module. It takes one parameter, a dictionary of options which will be passed to any spawned model controllers, to serve as default options.
const engine = new EclipseSearch(options);
This registers a new language model with the main engine instance, which can be shared across all model controllers. This takes two paremeters - languageId
and model
. languageId
must be a unique identifier for the language - the default languages uses the ISO 639-1 code for each language. The model must be a dictionary of options containing the following arguments:
Name | Optional | Description | Default Value |
stemWords | true | Controls whether the model actually stems the words provided. | true |
filterStopWords | true | Controls whether the model actually filters out the stop words. | true |
punctuation | true | Punctuation to remove. | A list of common latin-alphabet punctuation. |
stopWords | false | List of stop words to filter. | - |
stemFilter | false | Dictionary with keys of words to stem, and words to stem to as values. | - |
splitter | true | Function to split text into words. Function is provided with text and must return an array of words. | Default to in-built function which splits words along whitespace. |
filter | true | Function to filter stop words. Function is provided with a list of words to remove stop words from, and a list of stop words, and must return list of words without stop words. | Defaults to in-built function. |
stem | true | Function to stem words. Function is given a list of words and the value stored under stemFilter and must return the array of words with any relevant words stemmed. | Defaults to in-built function. |
removePunctuation | true | Function to remove punctuation. Function is provided with text and a list of punctuation and must return text without punctuation. | Defaults to in-built function. |
Two languages are already registered at the moment, English (en
) and German (de
).
This function takes in an ISO 639-1 code for a language and returns that language model if there is any registered for that language. This allows you to edit the property of the language processor, to change how languages are processed by EclipseSearch.
const EnglishModel = EclipseSearch.getLanguage("en");
This function takes in a model schema for a document, and a set of options, and returns a functioning model controller for that document design.
A schema must be provided to this function. A schema is a dictionary, with the keys being the names of indexes (e.g. attributes in the documents that will be provided) and the values either a string denoting the type of index, or a dictionary of options.
Each index can be provided with the following options:
Name | Optional | Description | Default Value |
type | false | Specifies the type of index for the content that will be provided. Currently only supported options are "text" and "id". If the field is "id", that must be a unique identifier for the document. Every schema must have an ID field. There can only be one ID field. | - |
weight | true | Controls the weight for that index, which is used by the ranking algorithm. | 1 |
language | true | Specifies a language for the field, which will be used when processing content provided for this field. This can override the language for the whole document, just for the specific field. | Language for whole document. |
index | true | Whether to actually index the provided content. If true provided, this index is searchable. | true |
return | true | Determines whether this field in the relevant document will be returned when a document is found in a query. | Defaults to option provided to whole controller. |
b | true | Controls the b-value for the field, as used by the ranking algorithm. | Defaults to option provided to whole controller. |
The model
function can also be provided with a dictionary of options, as follows:
Name | Optional | Description | Default Value |
pw | true | Specifies the value of pw, as used by the ranking algorithm. | 0.3 |
k | true | Specifies the value of k, as used by the ranking algorithm. | 2 |
defaultBValue | true | Value of b for an index, if no more specific value is provided for that index. | 0.5 |
returnByDefault | true | Whether to return a field by default, if no more specific option is provided for said field. | false |
defaultLanguage | true | Id of language to default to, if no more specific language is provided for an index. | en |
indexingStyle | true | Determines how each word is tokenized for partial-word search. Options are "aggressive", which returns tokens of every possible consecutive set of characters in a word, another option is "start", which only returns consecutive sets of characters which include the first letter, "end" which does the same as the "start" option but for sets which include the last letter, "start-end" which combines the behaviour of the "start" and "end" option, and "whole" which does add extra tokens to the words. The more tokens that are returned by the tokenizer for each word, the worse the performance. | aggressive |
lowercase | true | Whether to convert all provided content to lowercase, allowing case insensitive search. | true |
Create document takes in a document, in the form of a dictionary, with keys in the dictionary being the index names, and the values being the relevant content for that index. A unique ID for the document being created must be provided for the ID field.
Takes in the ID of a document, and the document iteself, and deletes it from the indexes, so that it is no longer returned from search results. If not all fields that are being indexed are being returned, you must provide the document again, so it can be properly removed from the engine.
The search function takes in a structured query and a set of options. The query can be constructured out of search terms and operators.
A search term can either be a single string, causing all indexes to be searched, or a dictionary with keys being the name of specific indexes and the values being the data to search for in that index.
These search terms can then be combined with operators and operands. An operator is a dictionary that contains the operator (and
or or
) and a list of operands - either operators or search terms.
The options that can be provided to the search function are as following:
Name | Optional | Description | Default Value |
language | true | This overrides the default processing language for the search query when processing general, single search terms. | Default processing language for the controller. |
limit | true | Limits the number of results that can be returned. | 20 |
fuzzySearch | true | Whether to use fuzzy search or not. | false |
searchDepth | true | Maximum Levenshtein distance query terms can be from search terms. | 2 |
rank | true | Whether to rank documents with ranking algorithm. | true |
Author: Tom
FAQs
A quick, well-featured full-text search engine that works in-browser and in Node.js.
The npm package eclipsesearch receives a total of 0 weekly downloads. As such, eclipsesearch popularity was classified as not popular.
We found that eclipsesearch demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Tea.xyz, a crypto project aimed at rewarding open source contributions, is once again facing backlash due to an influx of spam packages flooding public package registries.
Security News
As cyber threats become more autonomous, AI-powered defenses are crucial for businesses to stay ahead of attackers who can exploit software vulnerabilities at scale.
Security News
UnitedHealth Group disclosed that the ransomware attack on Change Healthcare compromised protected health information for millions in the U.S., with estimated costs to the company expected to reach $1 billion.