🚨 Latest Research:Tanstack npm Packages Compromised in Ongoing Mini Shai-Hulud Supply-Chain Attack.Learn More →
Socket
Book a DemoSign in
Socket

github.com/evanleleux/simhash

Package Overview
Dependencies
Versions
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

github.com/evanleleux/simhash

Source
Go Modules
Version
v0.0.0-20260222191741-5756091b4592
Version published
Created
Source

simhash

Small, production-quality Go module for fast 64-bit SimHash on HTML documents.

It includes:

  • Core streaming SimHash hasher (Hasher64)
  • Visible-text HTML tokenization (TokenizeHTMLText)
  • DOM-structure tokenization (TokenizeHTMLDOM)
  • Comparison helpers (Hamming64, Similarity64)

Install

go get github.com/evanleleux/simhash

Quickstart

package main

import (
	"fmt"

	"github.com/evanleleux/simhash"
)

func main() {
	a := []byte(`<html><body><h1>Checkout</h1><p>Pay securely</p></body></html>`)
	b := []byte(`<html><body><h1>Checkout</h1><p>Pay securely today</p></body></html>`)

	h1, _ := simhash.FingerprintHTMLText64(a)
	h2, _ := simhash.FingerprintHTMLText64(b)

	fmt.Printf("h1=0x%016x h2=0x%016x\n", h1, h2)
	fmt.Printf("hamming=%d similarity=%.4f\n", simhash.Hamming64(h1, h2), simhash.Similarity64(h1, h2))
}

API

  • FingerprintTokens64(tokens TokenStream, opts ...Option) (uint64, error)
  • FingerprintHTMLText64(html []byte, opts ...Option) (uint64, error)
  • FingerprintHTMLDOM64(html []byte, opts ...Option) (uint64, error)
  • Hamming64(a, b uint64) int
  • Similarity64(a, b uint64) float64

Hasher64 supports streaming accumulation without building token slices:

h := simhash.NewHasher64()
h.AddStringToken("checkout", 1)
h.AddStringToken("payment", 1)
fp := h.Sum64()
_ = fp

HTML Tokenization Behavior

Visible text

  • Parses with golang.org/x/net/html (no regex parsing)
  • Ignores text under <script> and <style>
  • Best-effort hidden filtering (hidden, inline display:none, visibility:hidden) by default
  • Collapses whitespace by tokenizing into word tokens

DOM structure

  • Emits path tokens like html/body/div/form/input
  • Ignores attributes by default (stable across changing classes/IDs)
  • Uses configurable max depth (default 8)
  • Can focus on form tags with WithDOMFormOnly(true)

Options

  • WithHashFunc(HashFunc64) to override hashing function
  • WithWeightFunc(WeightFunc) to override token weighting
  • WithMaxTextBytes(n int) to cap visible text bytes processed
  • WithDOMMaxDepth(depth int) to cap emitted DOM depth
  • WithIgnoreHidden(enabled bool) to toggle hidden-node filtering
  • WithLowercaseTags(enabled bool) to toggle lowercasing tag names
  • WithDOMFormOnly(enabled bool) to emit only form-related DOM paths

Default token hash is github.com/cespare/xxhash/v2.

Threshold Guidance

For 64-bit SimHash, near-duplicate detection often starts around Hamming distance <= 35, but this is dataset-dependent. Tune thresholds on your corpus and objective (precision vs recall).

Important Note

Do not shingle before SimHash in this workflow. Shingling is mainly useful for MinHash/Jaccard style similarity; this package is intended for direct token streams into SimHash.

Example Program

Run:

go run ./cmd/example [fileA.html fileB.html]

Without args, it compares:

  • https://evanleleux.dev/simhash/page-01 through https://evanleleux.dev/simhash/page-10

Default output includes per-page hashes plus adjacent-page similarity comparisons.

You can also pass two local file paths or URLs for direct pair comparison.

FAQs

Package last updated on 22 Feb 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts