New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details →
Socket
Book a DemoSign in
Socket

min-mphash

Package Overview
Dependencies
Maintainers
1
Versions
4
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

min-mphash

> Mini Minimal Perfect Hash & Mini Minimal Perfect Lookup

latest
npmnpm
Version
0.6.0
Version published
Maintainers
1
Created
Source
MinMPHash Logo

MinMPHash & MinMPLookup & MinMPHFilter

Mini Minimal Perfect Hash & Mini Minimal Perfect Lookup

中文 README

TypeScript/JavaScript 平台上的最小完美哈希与查找工具实现。如果只使用哈希函数包体积小于 3KB(Gzip)。

MinMPHash 可以把一组数量为 n 的字符串映射到 [0, n-1] 的整数范围内,且不会有冲突。

MinMPLookup 则是在 MinMPHash 的基础上实现的最小完美查找表工具。

它可以把形如 { key1: [value1, value2, ...], key2: [value3, value4, ...] } 的映射表最小化存储,从而实现根据 value 查找对应 key 的需求。

相比原样存储,最小化的查找表可以将体积缩小到原来的 10% 以下(具体压缩率取决于数据集中 values 的信息熵,熵越大压缩效果越好)。

MinMPHash can map a set of n strings to the integer range [0, n-1] without any collisions. If only the hash function is used, the package size is less than 3KB (Gzip).

MinMPLookup is a minimal perfect lookup table tool implemented based on MinMPHash.

It can minimize the storage of maps in the form of { key1: [value1, value2, ...], key2: [value3, value4, ...] }, thereby achieving the requirement of looking up the corresponding key based on value.

Compared to raw storage, the minimized lookup table can reduce the volume to less than 10% of the original (the specific compression rate depends on the information entropy of the values in the dataset; the higher the entropy, the better the compression effect).

What is Minimal Perfect Hash?

Hash functions can map data to a range, generating a fixed-length "fingerprint". However, ordinary hash functions have two common problems:

  • Space waste caused by sparsity: The hash range is usually much larger than the actual amount of data, resulting in a very sparse hash table.
  • Different inputs may conflict: Different inputs may map to the same hash value. To reduce the collision rate, longer hash values are usually required, which wastes space.

Minimal Perfect Hash Function (MPHF) is a special class of hash functions:

  • It guarantees no collisions for a given n distinct inputs;
  • The output range is exactly [0, n-1], making space utilization optimal.

In other words, if you have n different strings, MPHF will map them one-to-one to the integer range [0, n-1], with each string corresponding to a unique index.

    text set               Hash              Hash Table (Sparse)
  +----------+        +------------+        +-------------------+
  |  Apple   | ---->  |    h(x)    | ---->  | 0: [ Apple ]      |
  |  Banana  |        +------------+        | 1:                | <--- Gap
  |  Cherry  |                              | 2: [ Banana ]     |
  +----------+                              | 3:                | <--- Gap
                                            |       ...         | <--- Gap
                                            | 9: [ Cherry ]     |
                                            +-------------------+
                                              (Waste of Space)


    text set      🤩 Minimal Perfect Hash      Hash Table (Compact)
  +----------+        +--------------+        +-------------------+
  |  Apple   | ---->  |   mmph(x)    | ---->  | 0: [ Apple ]      |
  |  Banana  |        +--------------+        | 1: [ Banana ]     |
  |  Cherry  |                                | 2: [ Cherry ]     |
  +----------+                                +-------------------+
                                             ( 100% Space Utilization )

When to use Minimal Perfect Hash?

Minimal Perfect Hash is suitable for scenarios where a set of deterministic keys (Keys) needs to be mapped to compact integer indices. Compared with ordinary mapping tables, using MPHF can save the overhead of storing complete keys: only the integer index corresponding to the key needs to be stored to achieve the same mapping function.

In other words, as long as the keys in the dataset are deterministic (will not change), no matter how long the key itself is, it can be stored and uniquely identified by a number based on the dataset size without conflict.

This is exactly what common hashes (such as MD5, SHA-1) cannot do: the hash values they produce are long (e.g., MD5 is 16 bytes, SHA-1 is 20 bytes), while MPHF can choose the smallest integer range based on the dataset size, thereby achieving higher space utilization.

For example, there is a list of font names, and you want to map from the font name to the font family name. The general method is to complete all this through a mapping table:

let FontMap = {
  "Source Sans": "Source Sans",
  "Source Sans Black": "Source Sans",
  "Source Sans Bold": "Source Sans",
  "思源黑体 CN ExtraLight": "Source Sans",
  "思源黑体 TW Light": "Source Sans",
  // ... 6000+
};

let query = "思源黑体 TW Light";
let found = FontMap[query]; // 'Source Sans'

Such a mapping table needs to store all keys (such as font names), which takes up a lot of space when the number of keys is large. After using minimal perfect hash, we can only store the index (hash value) corresponding to the font name, instead of the full name:

// Create a set containing all font names
let values = [
  "Source Sans",
  "Source Sans Black",
  "Source Sans Bold",
  "思源黑体 CN ExtraLight",
  "思源黑体 TW Light",
  // ... 6000+
];

// Create minimal perfect hash dictionary based on values
let dict = createMinMPHashDict(values);
// Create hash function instance based on dictionary
let minHash = new MinMPHash(dict);

// Then we use hash values to replace full font names:
let FontMapWithHash = {
  "Source Sans": [1, 2, 3, 21 /* ... */],
  Heiti: [12, 12 /* ... */],
  "JetBrains Mono": [32, 112 /* ... */],
  // ...
};

// When querying, first calculate the hash value, then find the corresponding font family name through the hash value
let query = "思源黑体 TW Light";
let query_hash = minHash.hash(query, dict); // e.g. 42

let found = Object.entries(FontMapWithHash).find(([family, hashes]) =>
  hashes.includes(query_hash)
)[0]; // 'Source Sans'

This can significantly reduce storage space because there is no longer a need to save the full key text, only a shorter integer index.

You might think that hash functions like MD5 or SHA-1 can also generate identifiers, but their hash values are long (e.g., MD5 is 16 bytes, SHA-1 is 20 bytes). FNV-1a can be used to a minimum of 4 bytes in some scenarios, but its collision rate is high. Minimal perfect hash can choose the minimum range based on the dataset size, ensuring no conflicts and achieving extreme space utilization.

Usage

Installation

npm install min-mphash

This package only provides ESM, and using build tools can effectively perform tree-shaking to reduce the size.

// hash build & hush function
min-mphash/index.js         34.7 kB
min-mphash/index.min.js     14.6 kB   4.9 kB(Gzip)


// hush function only
min-mphash/runtime.js       18.3 kB
min-mphash/runtime.min.js   7.9 kB    2.8 kB(Gzip)

MinMPHash Usage

This is the core function, used to map a set of strings to integers in [0, n-1].

Step 1: Create Dictionary (Build Time)

Generate the hash dictionary in your build script or server-side code.

import { createMinMPHashDict } from "min-mphash";
import * as fs from "fs";

// Example string set
const mySet = ["Apple", "Banana", "Cherry", "Date", "Elderberry"];

// Create dictionary binary data
// outputBinary: true returns Uint8Array, suitable for storage or network transmission
const dictBuffer = createMinMPHashDict(mySet, {
  outputBinary: true,
  level: 5, // Optimization level [1-10], higher is smaller but slower to build
});

fs.writeFileSync("mph-dict.bin", dictBuffer);

Step 2: Use Dictionary to Generate Hash (Runtime)

Load the dictionary and perform hash queries in your application (e.g., browser side).

import { MinMPHash } from "min-mphash";

// Assume you have already loaded the binary data
const dictBuffer = await fetch("mph-dict.bin").then((res) => res.arrayBuffer());
const dict = new Uint8Array(dictBuffer);

const mph = new MinMPHash(dict);

console.log(mph.hash("Apple")); // 0 (or other unique value between 0-4)
console.log(mph.hash("Banana")); // 2
console.log(mph.hash("Cherry")); // 4

// Note: For strings not in the set, it will also return a value in [0, n-1] (this is a property of MPHF),
// unless you enable **Validation Mode** (see below).
console.log(mph.hash("sdfsd94jx#*")); // May return 1

Validation Mode onlySet

Standard minimal perfect hash functions will also return an index within the range for inputs not in the set (this is a property of MPHF). If your application needs to identify queries outside the set as "misses", you can enable validation mode onlySet when creating the dictionary.

onlySet will store the fingerprint of each key at the cost of extra space. When querying, the fingerprint will be verified: if the verification fails, -1 is returned indicating a miss.

let dict = createMinMPHashDict(mySet, { onlySet: "8" });
onlySetSpace Usage (per key)False Positive Rate
20.25 byte~25%
40.5 byte~6.25%
81 byte~0.39%
162 bytes~0.0015%
324 bytes~0.00000002%

Note: "False Positive Rate" in the table refers to the probability that an input not in the set is incorrectly judged to be in the set; if the key is indeed in the set, the verification always succeeds.

Dictionary Format: JSON/CBOR/CBOR.Gzip

createMinMPHashDict can output dictionaries in multiple formats:

  • Binary { outputBinary: true } Returns CBOR Uint8Array
  • Compressed Binary { outputBinary: true, enableCompression: true} Gzip compressed CBOR Uint8Array
  • JSON Default JavaScript object, suitable for debugging and viewing.

Generally speaking, it is recommended to use the compressed binary format, which has the smallest volume. But JSON is more convenient during development. If the server/CDN supports transparent compression, you can use the JSON format directly, and the final volume difference is not big.

MinMPLookup Usage

If you have a Value -> Key lookup requirement (for example: look up font family name by font file name, or look up country by city), and the amount of data is large, you can use MinMPLookup. It uses MPHF and differential encoding to greatly compress the mapping relationship.

Scenario

Suppose you have the following mapping:

const lookupMap = {
  China: ["Beijing", "Shanghai", "Guangzhou"],
  USA: ["New York", "Los Angeles"],
  Japan: ["Tokyo"],
};
// Goal: Input "Shanghai" -> Get "China"

Create Lookup Dictionary

import { createMinMPLookupDict } from "min-mphash";

const lookupMap = {
  China: ["Beijing", "Shanghai", "Guangzhou"],
  USA: ["New York", "Los Angeles"],
  Japan: ["Tokyo"],
};

// Generate compressed binary dictionary
const lookupDictBin = createMinMPLookupDict(lookupMap, {
  outputBinary: true,
  enableCompression: true, // Enable built-in Gzip compression (Node/Bun environment or browsers supporting CompressionStream only)
});

// Save to file
// fs.writeFileSync("lookup.bin", lookupDictBin);

Query

import { MinMPLookup } from "min-mphash";

// Load dictionary
const lookup = await MinMPLookup.fromCompressed(lookupDictBin);
// If enableCompression is not enabled, use MinMPLookup.fromBinary(bin)

console.log(lookup.query("Shanghai")); // "China"
console.log(lookup.queryAll("New York")); // ["USA"]
console.log(lookup.query("Unknown City")); // null
console.log(lookup.keys()); // ["China", "USA", "Japan"]

Validation Mode onlySet

Standard minimal perfect hash functions will also return an index within the range for inputs not in the set. To solve this problem, MinMPHash supports a built-in validation mode to ensure that lookups outside the set return null.

const lookupDictBin = createMinMPLookupDict(lookupMap, {
  onlySet: "8", // Enable 8-bit validation mode
});

MinMPHFilter Usage

MinMPHFilter is a static set query tool similar to a Bloom filter. It can efficiently determine whether an element exists in a set with a configurable false positive rate.

Background

If you have a set of strings and need to query whether a specific string is in the set, and the set does not change frequently (regenerating the dictionary file is required for every change), you can use MinMPHFilter. It allows you to store this set with maximum space savings, with a storage size close to the information-theoretic limit.

Its working principle is similar to a Bloom filter, but it is implemented using the GCS algorithm, providing higher space efficiency and a configurable false positive rate.

Usage

import { createMinMPHFilterDict, MinMPHFilter } from "min-mphash";
// Create filter dictionary
const filterDict = createMinMPHFilterDict(mySet, {
  //  False Positive Rate:
  //  *  - 6 : ~1.56%
  //  *  - 7 : ~0.78%
  //  *  - 8 : ~0.39%
  //  *  - 9 : ~0.20%
  //  *  - 10 : ~0.10%
  //  *  - 11 : ~0.05%
  //  *  - 12 : ~0.02%
  //  *  - 13 : ~0.01%
  //  *  - 14 : ~0.005%
  //  *  - 15 : ~0.0025%
  //  *  - 16 : ~0.0012%
  bitKey: "8",
  outputBinary: true, // Since the generated binary data has very high entropy, compression is basically unnecessary
});

// Use filter
const filter = new MinMPHFilter(filterDict);
console.log(filter.has("Apple")); // true

Benchmark

MinMPHash Dictionary Size Benchmark


=== MinMPHash Big Dataset Size Benchmark ===
Generating dataset of size 1000000...
Dataset size: 1000000 items

Dataset json size:        41836.25 KB
Dataset json gzip size:   6473.48 KB

➤ Optimization Level 5
  ┌──────────┬──────────────┬──────────────────────┬──────────────────────┬──────────────┬──────────────┐
  │ OnlySet  │ JSON Size    │ Binary Size (Ratio)  │ Gzip Size (Ratio)    │ vs None      │ Build Time   │
  ├──────────┼──────────────┼──────────────────────┼──────────────────────┼──────────────┼──────────────┤
  │ none     │    979.18 KB │  341.37 KB ( 35%)    │  268.36 KB ( 27%)    │      100.0 % │   2502.35 ms │
  │ 2        │   1849.51 KB │  585.38 KB ( 32%)    │  512.86 KB ( 28%)    │      171.5 % │   2981.46 ms │
  │ 4        │   2721.75 KB │  829.94 KB ( 30%)    │  757.49 KB ( 28%)    │      243.1 % │   3109.38 ms │
  │ 8        │   4465.27 KB │ 1318.06 KB ( 30%)    │ 1245.64 KB ( 28%)    │      386.1 % │   3132.11 ms │
  │ 16       │   6672.22 KB │ 2293.96 KB ( 34%)    │ 2222.43 KB ( 33%)    │      672.0 % │   3559.02 ms │
  │ 32       │  11468.63 KB │ 4247.06 KB ( 37%)    │ 4176.07 KB ( 36%)    │     1244.1 % │   2900.32 ms │
  └──────────┴──────────────┴──────────────────────┴──────────────────────┴──────────────┴──────────────┘

MinMPLookup Dictionary Size Benchmark


=== MinMPLookup Big Dataset Size Benchmark ===
Generating dataset of size 100000 values...
Dataset stats: 5000 keys, 100000 values

Dataset json size:        7577.04 KB
Dataset json gzip size:   5141.74 KB

➤ Optimization Level 5
  ┌──────────┬──────────────┬──────────────────────┬──────────────────────┬──────────────┬──────────────┐
  │ OnlySet  │ JSON Size    │ Binary Size (Ratio)  │ Gzip Size (Ratio)    │ vs None      │ Build Time   │
  ├──────────┼──────────────┼──────────────────────┼──────────────────────┼──────────────┼──────────────┤
  │ none     │    709.67 KB │  254.85 KB ( 36%)    │  199.59 KB ( 28%)    │      100.0 % │    412.65 ms │
  │ 2        │    797.00 KB │  279.23 KB ( 35%)    │  225.14 KB ( 28%)    │      109.6 % │    393.94 ms │
  │ 4        │    884.32 KB │  303.63 KB ( 34%)    │  248.92 KB ( 28%)    │      119.1 % │    408.93 ms │
  │ 8        │   1058.92 KB │  352.48 KB ( 33%)    │  297.58 KB ( 28%)    │      138.3 % │    477.32 ms │
  │ 16       │   1406.98 KB │  450.21 KB ( 32%)    │  395.21 KB ( 28%)    │      176.7 % │    421.70 ms │
  │ 32       │   2104.73 KB │  645.45 KB ( 31%)    │  591.02 KB ( 28%)    │      253.3 % │    374.06 ms │
  └──────────┴──────────────┴──────────────────────┴──────────────────────┴──────────────┴──────────────┘

MinMPHFilter Usage

=== MinMPHFilter Big Dataset Benchmark ===
Generating dataset of size 100000...
Dataset size: 100000 items

Dataset json size:        11024.31 KB
Dataset json gzip size:   7479.80 KB

  ┌────────┬──────────────┬──────────────┬──────────────┬──────────────┬──────────────┐
  │ BitKey │ Binary Size  │ Gzip Size    │ Build (ms)   │ Query 1 (ms) │ FPR (%)      │
  ├────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┤
  │ 6      │     95.67 KB │     94.94 KB │      65.2 ms │    0.0035 ms │      1.6140% │
  │ 8      │    120.80 KB │    120.06 KB │      57.3 ms │    0.0034 ms │      0.4230% │
  │ 10     │    145.22 KB │    144.54 KB │      57.2 ms │    0.0034 ms │      0.1110% │
  │ 12     │    169.63 KB │    169.20 KB │      58.2 ms │    0.0037 ms │      0.0250% │
  │ 14     │    194.41 KB │    193.74 KB │      60.1 ms │    0.0038 ms │      0.0070% │
  │ 16     │    219.21 KB │    218.67 KB │      56.4 ms │    0.0040 ms │      0.0020% │
  └────────┴──────────────┴──────────────┴──────────────┴──────────────┴──────────────┘

FAQs

Package last updated on 29 Dec 2025

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts