
Company News
Socket Named to Rising in Cyber 2026 List of Top Cybersecurity Startups
Socket was named to the Rising in Cyber 2026 list, recognizing 30 private cybersecurity startups selected by CISOs and security executives.
github.com/evanleleux/simhash
Advanced tools
Small, production-quality Go module for fast 64-bit SimHash on HTML documents.
It includes:
Hasher64)TokenizeHTMLText)TokenizeHTMLDOM)Hamming64, Similarity64)go get github.com/evanleleux/simhash
package main
import (
"fmt"
"github.com/evanleleux/simhash"
)
func main() {
a := []byte(`<html><body><h1>Checkout</h1><p>Pay securely</p></body></html>`)
b := []byte(`<html><body><h1>Checkout</h1><p>Pay securely today</p></body></html>`)
h1, _ := simhash.FingerprintHTMLText64(a)
h2, _ := simhash.FingerprintHTMLText64(b)
fmt.Printf("h1=0x%016x h2=0x%016x\n", h1, h2)
fmt.Printf("hamming=%d similarity=%.4f\n", simhash.Hamming64(h1, h2), simhash.Similarity64(h1, h2))
}
FingerprintTokens64(tokens TokenStream, opts ...Option) (uint64, error)FingerprintHTMLText64(html []byte, opts ...Option) (uint64, error)FingerprintHTMLDOM64(html []byte, opts ...Option) (uint64, error)Hamming64(a, b uint64) intSimilarity64(a, b uint64) float64Hasher64 supports streaming accumulation without building token slices:
h := simhash.NewHasher64()
h.AddStringToken("checkout", 1)
h.AddStringToken("payment", 1)
fp := h.Sum64()
_ = fp
golang.org/x/net/html (no regex parsing)<script> and <style>hidden, inline display:none, visibility:hidden) by defaulthtml/body/div/form/input8)WithDOMFormOnly(true)WithHashFunc(HashFunc64) to override hashing functionWithWeightFunc(WeightFunc) to override token weightingWithMaxTextBytes(n int) to cap visible text bytes processedWithDOMMaxDepth(depth int) to cap emitted DOM depthWithIgnoreHidden(enabled bool) to toggle hidden-node filteringWithLowercaseTags(enabled bool) to toggle lowercasing tag namesWithDOMFormOnly(enabled bool) to emit only form-related DOM pathsDefault token hash is github.com/cespare/xxhash/v2.
For 64-bit SimHash, near-duplicate detection often starts around Hamming distance <= 35, but this is dataset-dependent. Tune thresholds on your corpus and objective (precision vs recall).
Do not shingle before SimHash in this workflow. Shingling is mainly useful for MinHash/Jaccard style similarity; this package is intended for direct token streams into SimHash.
Run:
go run ./cmd/example [fileA.html fileB.html]
Without args, it compares:
https://evanleleux.dev/simhash/page-01 through https://evanleleux.dev/simhash/page-10Default output includes per-page hashes plus adjacent-page similarity comparisons.
You can also pass two local file paths or URLs for direct pair comparison.
FAQs
Unknown package
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Company News
Socket was named to the Rising in Cyber 2026 list, recognizing 30 private cybersecurity startups selected by CISOs and security executives.

Research
Socket detected 84 compromised TanStack npm package artifacts modified with suspected CI credential-stealing malware.

Security News
A dispute over fsnotify maintainer access set off supply chain alarms around one of Go’s most widely used filesystem libraries.