Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

github.com/mfonda/simhash

Package Overview
Dependencies
Alerts
File Explorer
Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

github.com/mfonda/simhash

  • v0.0.0-20151007195837-79f94a1100d6
  • Source
  • Go
  • Socket score

Version published
Created
Source

simhash

simhash is a Go implementation of Charikar's simhash algorithm.

simhash is a hash with the useful property that similar documents produce similar hashes. Therefore, if two documents are similar, the Hamming-distance between the simhash of the documents will be small.

This package currently just implements the simhash algorithm. Future work will make use of this package to enable quickly identifying near-duplicate documents within a large collection of documents.

Installation

go get github.com/mfonda/simhash

Usage

Using simhash first requires tokenizing a document into a set of features (done through the FeatureSet interface). This package provides an implementation, WordFeatureSet, which breaks tokenizes the document into individual words. Better results are possible here, and future work will go towards this.

Example usage:

package main

import (
	"fmt"
	"github.com/mfonda/simhash"
)

func main() {
	var docs = [][]byte{
		[]byte("this is a test phrase"),
		[]byte("this is a test phrass"),
		[]byte("foo bar"),
	}

	hashes := make([]uint64, len(docs))
	for i, d := range docs {
		hashes[i] = simhash.Simhash(simhash.NewWordFeatureSet(d))
		fmt.Printf("Simhash of %s: %x\n", d, hashes[i])
	}

	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[1], simhash.Compare(hashes[0], hashes[1]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[2], simhash.Compare(hashes[0], hashes[2]))
}

Output:

Simhash of this is a test phrase: 8c3a5f7e9ecb3f35
Simhash of this is a test phrass: 8c3a5f7e9ecb3f21
Simhash of foo bar: d8dbe7186bad3db3
Comparison of `this is a test phrase` and `this is a test phrass`: 2
Comparison of `this is a test phrase` and `foo bar`: 29

FAQs

Package last updated on 07 Oct 2015

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc