Socket
Socket
Sign inDemoInstall

eclipsesearch

Package Overview
Dependencies
0
Maintainers
1
Versions
2
Alerts
File Explorer

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

    eclipsesearch

A quick, well-featured full-text search engine that works in-browser and in Node.js.


Version published
Weekly downloads
0
Maintainers
1
Created
Weekly downloads
 

Readme

Source

Eclipse Search

Quick search. Great features. Zero Dependencies. Works in-browser. Permissive License.

FeaturesDesignCreditLicense

EclipseSearch is a full-text search engine written entirely in Javascript. This means it can work on the back-end (e.g. in Node.js) and on the front-end (in browser).

View the online demos here!

Features

Partial Word Query and Weighted Fields

You can easily search your database with queries that have different weights for different fields in structured documents. This is useful, as typically users want searching for titles priorities over, say, descriptions.

Partial word search is also supported without the need for more expensive fuzzy search.


Fuzzy search allows for fault-tolerance (particularly spelling mistakes) within queries to the database, allowing applications to have a better user experience.

Various variables that the fuzzy search relies on can be changed, to create your optimum fuzzy search experience.


Fine-tune ranking algorithm

You can easily fine-tune the constants in the BM25F-based ranking algorithm (in fuzzy and regular search), allowing it to be configured to your specific purpose.

Both the b-value (per field in a structured document) and the k-value can be adjusted, allowing for you to create an optimal ranking algorithm for your purpose.


Structured Search and Operators

Structured search can be used to search for different terms in specified fields, or terms in only specified fields, allowing for more accurate and specific search.

Operators can also be used to combine queries, to create even more detailed and advanced queries. Multiple operators exist including "OR" and "AND".


Multiple Languages Supported

Multiple languages can be specified and used to parse documents and queries, meaning many languages can be potentially supported.

Each different language can have a different stemmer, list of stop words, word splitter and punctuation remover.


Overview

Searching with EclipseSearch requires starting a model controller, which is provided with a document schema outlining the design of the document. Each model controller is isolated from other model controllers. Searching can then be performed on these controllers.

When documents are provided to a model controller, they are first parsed with the choosen language, which consists of passing them through a pipeline, including: a punctuation remover, a word spitter, a stop word remover and a stemmer. Each word is then put through a tokenizer, so partial word matching can be performed. Each document must have a unique document ID. The processed document is then stored in indexes. If configured as such, the engine will also store select parts of the documents to return to the user on a search. If not, only the IDs of the documents will be returned.

Searches can then be performed. A ranking algorithm, as detailed below will be applied to each document to determine the order in which to determine return them, and if a limit is applied, which documents to return. This ranking algorithm can be disabled. Differnet fields in the document can have different weights (significance if terms are found there) and whole documents can have different boosts (increasing the score for the entire document if search terms are found there).

Ranking Algorithm

Given a query , which contains the terms , the engine first fetches any indexed documents which contain any of the terms, and then ranks each with the ranking algorithm, and then returns the documents with the highest score. EclipseSearch uses a ranking algorithm that is essentially based off of BM25F, a modified version of BM25 that allows different fields to have different weights, as seen below:

where is defined as the following:

where is the boost applied to the document, is a constant, which has the be greated than 0, which by default is set to 2, is the total number of documents, is the number of documents in which occurs, is the number of times of the term in the field of the document, is the weight applied to the field, is a constant for the field , which is set between 0 and 1, is the number of terms in the field for the document and is the average length of the field across all documents.

As said above is set as . The larger the value of for a field, the more aggressively the search normalizes for frequency of each term. For example, in the movies demo if you search for "d" with a value of 1, it will return "Addicted" first as that has the highest density of the search term considering the number of words, but if you search for "d" with a value of 0, it returns "Hi Diddle Diddle" as that has the highest number of the search term, regardless of the number of words.

As said above has to be . A lower value of means the algorithm gives a high score to documents which cover many of the terms, whereas a high value of means the algorithm gives a high score to the documents with the highest term frequency.

The fuzzy search ranking algorithm is slightly different:

All notation is the same as the above equation, apart from the additional notation that returns a set of possible words that are with Levenshtein distance of the term , returns the Levenshtein distance between the two provided arguments and is a constant that determines how much a term's Levenshtein distance matters for ranking. The only logical values are , with a value closer to 0 means a more aggressive adjustment for distance.

Usage

If you want to use EclipseSearch on the desktop with Node.js, you can install EclipseSearch through NPM, like so:

$ npm i eclipsesearch

And then require it in a file:

const EclipseSearch = require("eclipsesearch");

Alternatively, you can use it in the browser, by adding the following script before (in the HTML) any other script referencing EclipseSearch:

<script src="https://unpkg.com/eclipsesearch/dist/EclipseSearch.min.js"></script>

Basic usage could then follow like so:

const engine = new EclipseSearch();
const model = searchEngine.model({
  name: {
    type: "text",
    weight: 10,
    language: "en",
    return: true,
  },
  description: {
    type: "text",
    return: true,
  },
  price: {
    index: false,
    return: true,
  },
  id: "id",
});

await model.createDocument({
  name: "Cake",
  description: "A very nice cake",
  id: 1,
  price: 5,
});

let results = await model.search("cake");

Documentation

Note: the options provided to EclipseSearch will waterfall to each ModelController, which will waterfall to each function, however any alternatives provided at any stage of this waterfall will override the waterfalled options.

EclipseSearch:

ModelController:

LanguageModel:

EclipseSearch

This is the entry point to the module. It takes one parameter, a dictionary of options which will be passed to any spawned model controllers, to serve as default options.

const engine = new EclipseSearch(options);

EclipseSearch.registerLanguage

This registers a new language model with the main engine instance, which can be shared across all model controllers. This takes two paremeters - languageId and model. languageId must be a unique identifier for the language - the default languages uses the ISO 639-1 code for each language. The model must be a dictionary of options containing the following arguments:

NameOptionalDescriptionDefault Value
stemWordstrueControls whether the model actually stems the words provided.true
filterStopWordstrueControls whether the model actually filters out the stop words.true
punctuationtruePunctuation to remove.A list of common latin-alphabet punctuation.
stopWordsfalseList of stop words to filter.-
stemFilterfalseDictionary with keys of words to stem, and words to stem to as values.-
splittertrueFunction to split text into words. Function is provided with text and must return an array of words.Default to in-built function which splits words along whitespace.
filtertrueFunction to filter stop words. Function is provided with a list of words to remove stop words from, and a list of stop words, and must return list of words without stop words.Defaults to in-built function.
stemtrueFunction to stem words. Function is given a list of words and the value stored under stemFilter and must return the array of words with any relevant words stemmed.Defaults to in-built function.
removePunctuationtrueFunction to remove punctuation. Function is provided with text and a list of punctuation and must return text without punctuation.Defaults to in-built function.

Two languages are already registered at the moment, English (en) and German (de).

EclipseSearch.getLanguage

This function takes in an ISO 639-1 code for a language and returns that language model if there is any registered for that language. This allows you to edit the property of the language processor, to change how languages are processed by EclipseSearch.

const EnglishModel = EclipseSearch.getLanguage("en");

EclipseSearch.model

This function takes in a model schema for a document, and a set of options, and returns a functioning model controller for that document design.

A schema must be provided to this function. A schema is a dictionary, with the keys being the names of indexes (e.g. attributes in the documents that will be provided) and the values either a string denoting the type of index, or a dictionary of options.

Each index can be provided with the following options:

NameOptionalDescriptionDefault Value
typefalseSpecifies the type of index for the content that will be provided. Currently only supported options are "text" and "id". If the field is "id", that must be a unique identifier for the document. Every schema must have an ID field. There can only be one ID field.-
weighttrueControls the weight for that index, which is used by the ranking algorithm.1
languagetrueSpecifies a language for the field, which will be used when processing content provided for this field. This can override the language for the whole document, just for the specific field.Language for whole document.
indextrueWhether to actually index the provided content. If true provided, this index is searchable.true
returntrueDetermines whether this field in the relevant document will be returned when a document is found in a query.Defaults to option provided to whole controller.
btrueControls the b-value for the field, as used by the ranking algorithm.Defaults to option provided to whole controller.

The model function can also be provided with a dictionary of options, as follows:

NameOptionalDescriptionDefault Value
pwtrueSpecifies the value of pw, as used by the ranking algorithm.0.3
ktrueSpecifies the value of k, as used by the ranking algorithm.2
defaultBValuetrueValue of b for an index, if no more specific value is provided for that index.0.5
returnByDefaulttrueWhether to return a field by default, if no more specific option is provided for said field.false
defaultLanguagetrueId of language to default to, if no more specific language is provided for an index.en
indexingStyletrueDetermines how each word is tokenized for partial-word search. Options are "aggressive", which returns tokens of every possible consecutive set of characters in a word, another option is "start", which only returns consecutive sets of characters which include the first letter, "end" which does the same as the "start" option but for sets which include the last letter, "start-end" which combines the behaviour of the "start" and "end" option, and "whole" which does add extra tokens to the words. The more tokens that are returned by the tokenizer for each word, the worse the performance.aggressive
lowercasetrueWhether to convert all provided content to lowercase, allowing case insensitive search.true

ModelController.createDocument

Create document takes in a document, in the form of a dictionary, with keys in the dictionary being the index names, and the values being the relevant content for that index. A unique ID for the document being created must be provided for the ID field.

ModelController.delete

Takes in the ID of a document, and the document iteself, and deletes it from the indexes, so that it is no longer returned from search results. If not all fields that are being indexed are being returned, you must provide the document again, so it can be properly removed from the engine.

ModelController.search

The search function takes in a structured query and a set of options. The query can be constructured out of search terms and operators.

A search term can either be a single string, causing all indexes to be searched, or a dictionary with keys being the name of specific indexes and the values being the data to search for in that index.

These search terms can then be combined with operators and operands. An operator is a dictionary that contains the operator (and or or) and a list of operands - either operators or search terms.

The options that can be provided to the search function are as following:

NameOptionalDescriptionDefault Value
languagetrueThis overrides the default processing language for the search query when processing general, single search terms.Default processing language for the controller.
limittrueLimits the number of results that can be returned.20
fuzzySearchtrueWhether to use fuzzy search or not.false
searchDepthtrueMaximum Levenshtein distance query terms can be from search terms.2
ranktrueWhether to rank documents with ranking algorithm.true

Credit

Author: Tom

License

MIT

FAQs

Last updated on 23 Jan 2022

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc