New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details →
Socket
Book a DemoSign in
Socket

minhash

Package Overview
Dependencies
Maintainers
1
Versions
6
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

minhash

Find similar documents quickly

latest
npmnpm
Version
0.0.9
Version published
Weekly downloads
3.2K
-0.75%
Maintainers
1
Weekly downloads
 
Created
Source
                            _           _                     _             _
                _ __ ___   (_)  _ __   | |__     __ _   ___  | |__         (_)  ___
               | '_ ` _ \  | | | '_ \  | '_ \   / _` | / __| | '_ \        | | / __|
               | | | | | | | | | | | | | | | | | (_| | \__ \ | | | |  _    | | \__ \
               |_| |_| |_| |_| |_| |_| |_| |_|  \__,_| |___/ |_| |_| (_)  _/ | |___/
                                                                         |__/

Build Status

Minhashing is an efficient similarity estimation technique that is often used to identify near-duplicate documents in large text collections. This package offers a JavaScript implementation of the minhash algorithm and an efficient Locality Sensitive Hashing Index for finding similar minhashes in Node.js or web applications.

Installation

To get started with Minhash.js, you can install the package with npm:

npm install minhash --save

If you prefer, you can instead load the package directly in a browser:

<script src='https://rawgit.com/duhaime/minhash/master/minhash.min.js' />

Minhash Usage

Minhashes are hash representations of the contents within a set. The following example minhashes and then estimates the Jaccard similarity between two sets:

import { Minhash } from 'minhash'; // If using Node.js

var s1 = ['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for',
        'estimating', 'the', 'similarity', 'between', 'datasets'];
var s2 = ['minhash', 'is', 'a', 'probability', 'data', 'structure', 'for',
        'estimating', 'the', 'similarity', 'between', 'documents'];

// create a hash for each set of words to compare
var m1 = new Minhash();
var m2 = new Minhash();

// update each hash
s1.map(function(w) { m1.update(w) });
s2.map(function(w) { m2.update(w) });

// estimate the jaccard similarity between two minhashes
m1.jaccard(m2);

LshIndex Usage

While one can compare the Jaccard similarity between a minhash and all others in a collection, the complexity of doing so is O(n), as one needs to compare the query set to every other set.

To estimate the results of the same comparison in sub-linear time, one can instead build a Locality Sensitive Hash Index, which maps hash sequences from a minhash signature to the list of document identifiers that contain the given hash sequence. Using this indexing technique, one can effectively find sets similar to a query set:

import { Minhash, LshIndex } from 'minhash'; // If using Node.js

var s1 = ['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for',
        'estimating', 'the', 'similarity', 'between', 'datasets'];
var s2 = ['minhash', 'is', 'a', 'probability', 'data', 'structure', 'for',
        'estimating', 'the', 'similarity', 'between', 'documents'];
var s3 = ['cats', 'are', 'tall', 'and', 'have', 'been',
        'known', 'to', 'sing', 'quite', 'loudly'];

// generate a hash for each list of words
var m1 = new Minhash();
var m2 = new Minhash();
var m3 = new Minhash();

// update each hash
s1.map(function(w) { m1.update(w) });
s2.map(function(w) { m2.update(w) });
s3.map(function(w) { m3.update(w) });

// add each document to a Locality Sensitive Hashing index
var index = new LshIndex();
index.insert('m1', m1);
index.insert('m2', m2);
index.insert('m3', m3);

// query for documents that appear similar to a query document
var matches = index.query(m1);
console.log('Jaccard similarity >= 0.5 to m1:', matches);

Example

The sample application uses minhash.js to compute the similarity between several sample documents:

app preview

There is also a sample Node.js script that can be run with node examples/index.js.

Development

To run the test suite — npm run test.
To compile and minify minhash.min.js — npm run build.

Keywords

minhash

FAQs

Package last updated on 02 Jun 2018

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts