Fingerprint

github.com/wizenheimer/comet

Package comet implements a BM25-based full-text search index. WHAT IS BM25? BM25 (Best Matching 25) is a probabilistic ranking function used to estimate the relevance of documents to a given search query. It is one of the most widely used ranking functions in information retrieval. HOW BM25 WORKS: For a given query Q with terms {t1, t2, ..., tn} and document D: 1. Tokenizes and normalizes both query and documents using UAX#29 word segmentation 2. For each query term, calculates: 3. Final score is the sum of (IDF × TF) for all query terms KEY PARAMETERS: TIME COMPLEXITY: MEMORY REQUIREMENTS: - Stores inverted index (term -> docIDs) using roaring bitmaps for compression - Stores term frequencies (term -> docID -> count) - Stores document lengths and tokens (not full text) - Much more memory efficient than storing full document text GUARANTEES & TRADE-OFFS: ✓ Pros: ✗ Cons: WHEN TO USE: Use BM25 index when: 1. You need full-text search with relevance ranking 2. You want fast keyword-based search 3. Memory efficiency is important (vs storing full text) 4. You have your own document store and just need search Package comet provides a high-performance hybrid vector search library for Go. Comet combines multiple indexing strategies and search modalities into a unified, efficient package. It supports semantic search (vector embeddings), full-text search (BM25), metadata filtering, and hybrid search with score fusion. Comet is built for developers who want to understand how vector databases work from the inside out. It provides production-ready implementations of modern vector search algorithms with comprehensive documentation and examples. Create a vector index and perform similarity search: Comet provides five vector index implementations, each with different tradeoffs: FlatIndex: Brute-force exhaustive search with 100% recall. Best for small datasets (<10K vectors) or when perfect accuracy is required. HNSWIndex: Hierarchical graph-based search with 95-99% recall and O(log n) performance. Best for most production workloads (10K-10M vectors). IVFIndex: Inverted file index using k-means clustering with 85-95% recall. Best for large datasets (>100K vectors) with moderate accuracy requirements. PQIndex: Product quantization for massive memory compression (10-500x smaller) with 70-85% recall. Best for memory-constrained environments. IVFPQIndex: Combines IVF and PQ for maximum scalability with 70-90% recall. Best for billion-scale datasets. Three distance metrics are supported: Euclidean (L2): Measures absolute spatial distance. Use when magnitude matters. L2Squared: Squared Euclidean distance (faster, preserves ordering). Use for better performance when only relative distances matter. Cosine: Measures angular similarity, independent of magnitude. Use for normalized vectors like text embeddings. BM25-based full-text search with Unicode tokenization: Fast filtering using Roaring Bitmaps and Bit-Sliced Indexes: Combine vector, text, and metadata search with score fusion: When combining results from multiple search modalities, different fusion strategies are available: WeightedSumFusion: Linear combination with configurable weights ReciprocalRankFusion: Rank-based fusion (scale-independent, recommended) MaxFusion/MinFusion: Simple maximum or minimum across modalities All indexes support persistence: HNSW parameters for tuning search quality: IVF parameters for tuning speed/accuracy: All indexes are safe for concurrent use. Multiple goroutines can search simultaneously while one goroutine adds or removes vectors. Document Search: Use vector embeddings for semantic search in documentation, knowledge bases, or content management systems. Product Recommendations: Combine product image embeddings with metadata filters for personalized recommendations. Question Answering: Use hybrid search (vector + BM25) for retrieval-augmented generation (RAG) systems. Duplicate Detection: Use high-recall vector search to find near-duplicate documents or images. Multi-modal Search: Combine text, image embeddings, and structured metadata for comprehensive search experiences. Choose the right index type: Use appropriate distance metrics: Batch operations: Training indexes: Metadata filtering: For detailed API documentation, see the godoc comments on each type and function. For more examples and use cases, visit: https://github.com/wizenheimer/comet Package comet implements a k-Nearest Neighbors (kNN) flat index for similarity search. WHAT IS A FLAT INDEX? A flat index is the most naive and simple approach to similarity search. The term "flat" indicates that vectors are stored without any compression or transformation - they are stored "as-is" in their original form. This is also known as brute-force or exhaustive search. HOW kNN WORKS: For a given query vector Q, the algorithm: 1. Calculates the distance from Q to EVERY vector in the dataset 2. Sorts all distances 3. Returns the k vectors with the smallest distances TIME COMPLEXITY: MEMORY REQUIREMENTS: - 4 bytes per float32 component - Total per vector: 4 * d bytes (where d is the dimensionality) - No compression, so memory scales linearly with dataset size GUARANTEES & TRADE-OFFS: ✓ Pros: ✗ Cons: WHEN TO USE: Use flat index only when: 1. Dataset size or embedding dimensionality is relatively small 2. You MUST have 100% accuracy (e.g., fingerprint matching, security applications) 3. Speed is not a critical concern Package comet implements HNSW (Hierarchical Navigable Small World). WHAT IS HNSW? HNSW is a state-of-the-art graph-based algorithm for approximate nearest neighbor search. It builds a multi-layered graph where search is O(log n) - incredibly fast! Layer 2: Few nodes, long-range connections (highways) Layer 1: More nodes, medium-range connections (state roads) Layer 0: All nodes, short-range connections (local streets) Search starts at top layer and descends, getting more refined at each level! PERFORMANCE: TIME COMPLEXITY: Package comet implements a hybrid search index that combines vector, text, and metadata search. WHAT IS HYBRIDSEARCHINDEX? HybridSearchIndex is a facade that provides a unified interface over three specialized indexes: 1. VectorIndex: For semantic similarity search using vector embeddings 2. TextIndex: For keyword-based BM25 full-text search 3. MetadataIndex: For filtering by structured metadata attributes HOW IT WORKS: The index maintains three separate indexes internally and coordinates search across them. When searching, it follows this flow: 1. Apply metadata filters first (if any) to get candidate document IDs 2. Pass candidate IDs to vector and/or text search for relevance ranking 3. Combine results from multiple search modes using score aggregation SEARCH MODES: - Vector-only: Semantic similarity search using embeddings - Text-only: Keyword-based BM25 search - Metadata-only: Pure filtering without ranking - Hybrid: Combine any or all of the above with score aggregation WHEN TO USE: Use HybridSearchIndex when: 1. You need to combine multiple search modalities 2. You want to filter by metadata before expensive vector search 3. You need both semantic and keyword-based search 4. You want a simple unified API instead of managing multiple indexes Package comet implements a k-Nearest Neighbors (kNN) IVF index for similarity search. WHAT IS AN IVF INDEX? IVF (Inverted File Index) is a partitioning-based approximate nearest neighbor search algorithm. It divides the vector space into Voronoi cells using k-means clustering, then searches only the nearest cells instead of scanning all vectors. HOW IVF WORKS: Training Phase: 1. Run k-means on training vectors to learn nlist cluster centroids 2. These centroids define Voronoi partitions of the vector space Indexing Phase: 1. For each vector, find its nearest centroid 2. Add the vector to that centroid's inverted list Search Phase: 1. Find the nprobe nearest centroids to the query vector 2. Search only the vectors in those nprobe inverted lists 3. Return the top-k nearest neighbors from candidates TIME COMPLEXITY: MEMORY REQUIREMENTS: - Vectors: 4 × n × dim bytes (stored as-is) - Centroids: 4 × nlist × dim bytes - Lists: negligible overhead (just pointers) - Total: ~4 × (n + nlist) × dim bytes ACCURACY VS SPEED TRADEOFF: - nprobe = 1: Fastest, lowest recall (~30-50%) - nprobe = sqrt(nlist): Good balance (~70-90% recall) - nprobe = nlist: Same as flat search (100% recall) CHOOSING NLIST: Rule of thumb: nlist = sqrt(n) or nlist = 4*sqrt(n) - For 1M vectors: nlist = 1,000 to 4,000 - For 100K vectors: nlist = 316 to 1,264 - For 10K vectors: nlist = 100 to 400 WHEN TO USE IVF: Use IVF when: 1. Dataset is large (>10K vectors) 2. You can tolerate ~90-95% recall (not 100%) 3. You want 10-100x speedup over flat search 4. Memory usage is not a primary concern DON'T use IVF when: 1. Dataset is small (<10K vectors) - use flat index 2. You need 100% recall - use flat index 3. Memory is very limited - use PQ or IVFPQ Package comet implements IVFPQ (Inverted File with Product Quantization). WHAT IS IVFPQ? IVFPQ combines IVF (scope reduction) with PQ (compression) to create one of the most powerful similarity search algorithms. It's the workhorse of large-scale vector search systems. RESIDUAL VECTORS IVFPQ encodes RESIDUALS (vector - centroid) instead of original vectors. This dramatically improves compression quality because: PERFORMANCE: TIME COMPLEXITY: Package comet implements a metadata filtering index for vector search. WHAT IS A METADATA INDEX? A metadata index enables fast filtering of documents based on structured metadata attributes before performing expensive vector similarity searches. This dramatically improves search performance by reducing the candidate set. HOW IT WORKS: The index uses two specialized data structures: 1. Roaring Bitmaps: For categorical fields (strings, booleans) 2. Bit-Sliced Index (BSI): For numeric fields (integers, floats) QUERY TYPES: - Equality: field = value - Inequality: field != value - Comparisons: field > value, field >= value, field < value, field <= value - Range: field BETWEEN min AND max - Set membership: field IN (val1, val2, val3) - Set exclusion: field NOT IN (val1, val2) - Existence: field EXISTS, field NOT EXISTS TIME COMPLEXITY: MEMORY REQUIREMENTS: - Roaring bitmaps: Highly compressed, typically 1-10% of uncompressed size - BSI: ~64 bits per numeric value (compressed with roaring) - Much more efficient than traditional B-tree indexes for high-cardinality data GUARANTEES & TRADE-OFFS: ✓ Pros: ✗ Cons: WHEN TO USE: Use metadata index when: 1. Pre-filtering documents before vector search 2. Need to filter by structured attributes (price, date, category, etc.) 3. Working with large datasets (100K+ documents) 4. Need sub-millisecond filter performance Package comet implements Product Quantization (PQ) for similarity search. WHAT IS PRODUCT QUANTIZATION? PQ is a lossy compression technique that dramatically reduces memory usage for vector storage while enabling approximate similarity search. It achieves compression ratios of 10-500x by dividing vectors into subspaces and quantizing each independently. THE CORE IDEA - DIVIDE AND COMPRESS: Instead of storing full high-dimensional vectors: 1. Divide each vector into M equal-sized subvectors (subspaces) 2. Learn a codebook of K centroids for each subspace via k-means 3. Encode each subvector with the ID of its nearest centroid 4. Store only these compact codes instead of original vectors COMPRESSION EXAMPLE: Original: 768 dims × 4 bytes = 3,072 bytes PQ (M=8, K=256): 8 subspaces × 1 byte = 8 bytes Compression: 384x smaller! TIME COMPLEXITY: WHEN TO USE PQ: - Dataset too large for RAM - Can tolerate 85-95% recall - L2 or inner product metric - Want massive compression

v0.1.1 • last month

github.com/fingerprintjs/fingerprintjs2

github.com/Static-Flow/gofingerprint

github.com/windhooked/fingerprints

github.com/issacg/tasmota-fingerprint

github.com/sachin-puranik/verizy-go-fingerprint

github.com/nejckorasa/dir-fingerprint

github.com/andrewkroh/beats-processor-fingerprint

github.com/grahambrooks/fingerprint

github.com/aibudaevv/fingerprint

github.com/043674543/tls-fingerprint-api

github.com/alevinval/fingerprints

github.com/developerek/fingerprint

github.com/ArielM24/fingerprint_package

github.com/dallasmarlow/eks_cert_fingerprint_indexer

github.com/andrewstucki/fingerprint

github.com/zzma/asn1-fingerprint

github.com/gowww/static

github.com/wizenheimer/comet

github.1485827954.workers.dev/go-dedup/simhash

github.com/xorium/cuckoofilter

github.com/n0madic/go-wstunnel

github.com/sachinpuranik/verizy-go-fingerprint

github.com/SachinPuranik/verizy-go-fingerprint

github.com/SachinPuranik/verizy-go-fingerprint/fingerprint

github.com/samuelb/ssl-pubkey-fingerprint-exporter

github.com/pierrelalanne/src-fingerprint

github.com/k8stopologyawareschedwg/podfingerprint

github.com/alfiedotwtf/pgp-fingerprint

github.com/tangjizhou/clear-fingerprint

d.zyszy.best/go-dedup/simhash

github.com/open-ch/JA3

github.com/thomaslsimpson/stopwords

github.com/glycerine/restic-chunker-mod

github.com/aaronland/go-fingerprint

github.com/gdbu/fingerprints

github.com/koud-fi/pkg/fingerprint

github.com/marcsantiago/go-fingerprint

github.com/joelpm3/fingerprinting

github.com/fingerprintjs/fingerprint-pro-server-api-go-sdk/v5

github.com/fingerprintjs/fingerprint-pro-server-api-go-sdk/v4

github.com/onlpsec/fingerprint

github.com/baizhiwen/fingerprint

github.com/OperatorFoundation/shapeshifter-transports/transports/meekserver/v2

github.com/freedomofpress/wa-knn-fingerprint-securedrop

github.com/SynapseFI/SynapseGo

gopkg.in/go-dedup/simhash.v2

github.com/nhalstead/go-ssl-fingerprint

gopkg.in/virgilsecurity/virgil-sdk-go.v4

github.com/fingerprintjs/fingerprint-pro-server-api-go-sdk/example

github.com/OperatorFoundation/shapeshifter-transports/transports/meeklite/v3