🚀 Socket Launch Week Day 5:Introducing Repository Access Permissions and Custom Roles.Learn more →

text-similarity-node

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

text-similarity-node

High-performance and memory efficient native C++ text similarity algorithms for Node.js

latest

Source

npm

Version: 1.2.0

Version published: 4 months ago

Weekly downloads: 101

Maintainers: 1

Weekly downloads

Created: 11 months ago

Source

text-similarity-node

Node-gyp NPM Version

High-performance and memory efficient native C++ text similarity algorithms for Node.js with full Unicode support. text-similarity-node provides a suite of production-ready algorithms that demonstrably outperform pure JavaScript alternatives, especially in memory usage and specific use cases. This library is the best choice for comparing large documents where other JavaScript libraries slow down.

Key Features

High Performance: Native C++ implementation which is fast and efficient compared to pure JavaScript libraries
Memory Efficient: Optimized for low memory usage and high throughput
Asynchronous API: Non-blocking operations using worker threads
Unicode Support: Full UTF-8 support including emoji and international characters
Multiple Algorithms: 7+ algorithms for different similarity needs
Production Ready: Memory safety, comprehensive testing, and error handling
Easy Integration: Simple API compatible with existing workflows

Prerequisites

Before installing, ensure you have the necessary build tools installed on your system:

Windows

Visual Studio 2017 or newer (with "Desktop development with C++" workload installed).
Python 3.x (required by node-gyp).

macOS

Xcode Command Line Tools (xcode-select --install).

Linux

GCC/G++ and Python 3.x.

Installation

npm install text-similarity-node

CLI Usage

After installing globally, you can use the text-similarity command directly from your terminal:

npm install -g text-similarity-node

Similarity

Calculate a similarity score (0–1) between two strings:

# Default (Levenshtein)
text-similarity similarity "hello" "hallo"
# 0.8

# Choose a different algorithm
text-similarity similarity "hello" "hallo" -a jaro-winkler
# 0.88

# Case-insensitive comparison
text-similarity similarity "Hello" "hello" -i
# 1

# JSON output
text-similarity similarity "hello" "hallo" -a cosine -f json
# { "success": true, "value": 0.5 }

Distance

Calculate the distance between two strings:

text-similarity distance "kitten" "sitting"
# 3

text-similarity distance "hello" "hallo" -a hamming
# 1

Batch Processing

Process multiple string pairs from a JSON file:

# pairs.json: [["hello","hallo"],["world","word"],["test","best"]]
text-similarity batch pairs.json -a levenshtein
# "hello" <-> "hallo"  =>  0.8
# "world" <-> "word"   =>  0.8
# "test"  <-> "best"   =>  0.75

text-similarity batch pairs.json -a jaccard -f json

List Algorithms

text-similarity algorithms

All CLI Options

Option	Description
`-a, --algorithm <name>`	Algorithm to use (default: `levenshtein`)
`-p, --preprocessing <mode>`	Preprocessing: `none`, `character`, `word`, `ngram`
`-i, --ignore-case`	Case-insensitive comparison
`-n, --ngram-size <size>`	N-gram size (default: `2`)
`--threshold <value>`	Early termination threshold
`--alpha <value>`	Alpha weight for Tversky index
`--beta <value>`	Beta weight for Tversky index
`--prefix-weight <value>`	Prefix weight for Jaro-Winkler (0.0–0.25)
`-f, --format <type>`	Output format: `plain` (default), `json`
`-v, --version`	Show version
`-h, --help`	Show help

Quick Start

const textSimilarity = require("text-similarity-node");

// Levenshtein Similarity (edit distance)
textSimilarity.similarity.levenshtein("hello", "hallo"); // 0.8

// Jaccard Similarity (set intersection)
textSimilarity.similarity.jaccard("hello world", "hello universe", true); // 0.33

// Cosine Similarity with different options
textSimilarity.similarity.cosine("hello", "hallo"); // 0.5 (character n-grams)
textSimilarity.similarity.cosine("hello world", "hello universe", true); // 0.49 (word-based)

// Additional algorithms
textSimilarity.similarity.jaro("hello", "hallo"); // 0.86
textSimilarity.similarity.jaroWinkler("hello", "hallo"); // 0.88
textSimilarity.similarity.dice("hello", "hallo"); // 0.5

// Distance measurements
textSimilarity.distance.levenshtein("hello", "hallo"); // 1
textSimilarity.distance.hamming("hello", "hallo"); // 1

// Unicode Support
textSimilarity.similarity.levenshtein("café", "cafe"); // 0.75
textSimilarity.similarity.jaccard("Hello 👋 World 🌍", "Hello 👋 World 🌎"); // 0.86 (different globe emoji)

// Case-insensitive comparison
textSimilarity.similarity.levenshtein("Hello", "hello", false); // 1.0

Algorithm Overview

The text-similarity-node library was created based on algorithm implementations from the TextDistance Python library, achieving a 95% success rate for result compatibility between this library and the reference Python version. The 95% compatibility rate is due to different tokenization methods implemented for cosine similarity calculations.

Edit-Based Algorithms

Levenshtein Distance: Classic edit distance for spell checking and typo detection
Hamming Distance: Fixed-length string comparison for error detection
Jaro Similarity: Optimized for short strings and proper names
Jaro-Winkler: Enhanced Jaro with prefix matching bonus

Token-Based Algorithms

Jaccard Similarity: Set intersection for document similarity
Sorensen-Dice: Harmonic mean of precision and recall
Overlap Coefficient: Measures subset relationships

Vector-Based Algorithms

Cosine Similarity: Angular distance in vector space
Character Vectorization: Optimized frequency-based comparison

API Reference

Modern API

The Modern API provides comprehensive configuration options and consistent return formats:

const textSimilarity = require("text-similarity-node");

// Basic similarity calculation
const result = textSimilarity.calculateSimilarity("hello", "hallo");
console.log(result); // { success: true, value: 0.8 }

// Specify algorithm type
const result2 = textSimilarity.calculateSimilarity(
  "hello world",
  "hello universe",
  textSimilarity.AlgorithmType.JACCARD,
);
console.log(result2); // { success: true, value: 0.39 }

// Full configuration example
const result3 = textSimilarity.calculateSimilarity(
  "hello world",
  "world hello",
  textSimilarity.AlgorithmType.COSINE,
  {
    preprocessing: textSimilarity.PreprocessingMode.WORD,
    caseSensitivity: textSimilarity.CaseSensitivity.INSENSITIVE,
    ngramSize: 2,
  },
);
console.log(result3); // { success: true, value: 1.0 }

// Advanced algorithm-specific configuration
const jaroWinklerResult = textSimilarity.calculateSimilarity(
  "martha",
  "marhta",
  textSimilarity.AlgorithmType.JARO_WINKLER,
  {
    prefixWeight: 0.1,
    prefixLength: 4,
    caseSensitivity: textSimilarity.CaseSensitivity.INSENSITIVE,
  },
);

// Tversky similarity with custom weights
const tverskyResult = textSimilarity.calculateSimilarity(
  "information retrieval",
  "information extraction",
  textSimilarity.AlgorithmType.TVERSKY,
  {
    preprocessing: textSimilarity.PreprocessingMode.WORD,
    alpha: 0.8, // Weight for first string
    beta: 0.2, // Weight for second string
    caseSensitivity: textSimilarity.CaseSensitivity.INSENSITIVE,
  },
);

// Distance calculations
const distance = textSimilarity.calculateDistance(
  "kitten",
  "sitting",
  textSimilarity.AlgorithmType.LEVENSHTEIN,
);
console.log(distance); // { success: true, value: 3 }

// Batch processing
const pairs = [
  ["hello", "hallo"],
  ["world", "word"],
  ["test", "best"],
];
const batchResults = textSimilarity.calculateSimilarityBatch(
  pairs,
  textSimilarity.AlgorithmType.LEVENSHTEIN,
  { caseSensitivity: textSimilarity.CaseSensitivity.INSENSITIVE },
);
console.log(batchResults);
// [{ success: true, value: 0.8 }, { success: true, value: 0.8 }, { success: true, value: 0.75 }]

// Asynchronous API
async function example() {
  const similarity = await textSimilarity.calculateSimilarityAsync(
    "hello world",
    "hello universe",
    textSimilarity.AlgorithmType.COSINE,
    {
      preprocessing: textSimilarity.PreprocessingMode.WORD,
      caseSensitivity: textSimilarity.CaseSensitivity.INSENSITIVE,
    },
  );
  console.log(similarity); // 0.5

  const batchAsync = await textSimilarity.calculateSimilarityBatchAsync(
    pairs,
    textSimilarity.AlgorithmType.JACCARD,
  );
  console.log(batchAsync); // [0.67, 0.8, 0.6]
}

// Global configuration
textSimilarity.setGlobalConfiguration({
  preprocessing: textSimilarity.PreprocessingMode.WORD,
  caseSensitivity: textSimilarity.CaseSensitivity.INSENSITIVE,
  ngramSize: 3,
});

// All subsequent calls will use global config unless overridden
const withGlobalConfig = textSimilarity.calculateSimilarity(
  "Hello World",
  "hello world",
);
console.log(withGlobalConfig); // { success: true, value: 1.0 }

// Override global config for specific call
const overrideGlobal = textSimilarity.calculateSimilarity(
  "Hello World",
  "hello world",
  textSimilarity.AlgorithmType.LEVENSHTEIN,
  { caseSensitivity: textSimilarity.CaseSensitivity.SENSITIVE },
);
console.log(overrideGlobal); // { success: true, value: 0.82 }

// Increase max string length for large document comparison (default: 100KB)
textSimilarity.setGlobalConfiguration({
  maxStringLength: 5 * 1024 * 1024, // Allow up to 5MB strings
});

const docResult = textSimilarity.calculateSimilarity(
  largeDocument1,
  largeDocument2,
  textSimilarity.AlgorithmType.EUCLIDEAN,
  { preprocessing: textSimilarity.PreprocessingMode.WORD },
);

Configuration Options

// Available algorithm types
textSimilarity.AlgorithmType = {
  LEVENSHTEIN: 0, // Edit distance
  DAMERAU_LEVENSHTEIN: 1, // Edit distance with transpositions
  HAMMING: 2, // Equal-length string distance
  JARO: 3, // Fuzzy string matching
  JARO_WINKLER: 4, // Jaro with prefix weighting
  JACCARD: 5, // Set similarity coefficient
  SORENSEN_DICE: 6, // Dice coefficient
  OVERLAP: 7, // Overlap coefficient
  TVERSKY: 8, // Asymmetric similarity with weights
  COSINE: 9, // Vector space cosine similarity
  EUCLIDEAN: 10, // Euclidean distance
  MANHATTAN: 11, // Manhattan distance
  CHEBYSHEV: 12, // Chebyshev distance
};

// Preprocessing modes
textSimilarity.PreprocessingMode = {
  NONE: 0, // No preprocessing
  CHARACTER: 1, // Character-level comparison
  WORD: 2, // Word-level tokenization
  NGRAM: 3, // N-gram based tokenization
};

// Case sensitivity options
textSimilarity.CaseSensitivity = {
  SENSITIVE: 0, // Case-sensitive comparison
  INSENSITIVE: 1, // Case-insensitive with Unicode support
};

// Full configuration object structure
const fullConfig = {
  algorithm: textSimilarity.AlgorithmType.COSINE, // Algorithm to use
  preprocessing: textSimilarity.PreprocessingMode.WORD, // Text processing mode
  caseSensitivity: textSimilarity.CaseSensitivity.INSENSITIVE, // Case handling
  ngramSize: 2, // N-gram size (default: 2)
  threshold: 0.5, // Early termination threshold
  alpha: 0.5, // Tversky alpha parameter
  beta: 0.5, // Tversky beta parameter
  prefixWeight: 0.1, // Jaro-Winkler prefix weight (0.0-0.25)
  prefixLength: 4, // Jaro-Winkler max prefix length
  maxStringLength: 100000, // Max input string length in bytes (default: 100000 ≈ 100KB)
};

Utility Functions

// Get supported algorithms
const algorithms = textSimilarity.getSupportedAlgorithms();
console.log(algorithms);
// [{ type: 0, name: 'LEVENSHTEIN' }, { type: 5, name: 'JACCARD' }, ...]

// Memory management
const memoryUsage = textSimilarity.getMemoryUsage();
console.log(`Memory usage: ${memoryUsage} bytes`);

textSimilarity.clearCaches(); // Clear internal caches

// Get current global configuration
const currentConfig = textSimilarity.getGlobalConfiguration();
console.log(currentConfig);

Convenience API

Similarity Functions

// Edit-based algorithms
textSimilarity.similarity.levenshtein(s1, s2, (caseSensitive = true));
textSimilarity.similarity.damerauLevenshtein(s1, s2, (caseSensitive = true));
textSimilarity.similarity.hamming(s1, s2, (caseSensitive = true));

// Phonetic algorithms
textSimilarity.similarity.jaro(s1, s2, (caseSensitive = true));
textSimilarity.similarity.jaroWinkler(
  s1,
  s2,
  (caseSensitive = true),
  (prefixWeight = 0.1),
);

// Token-based algorithms
textSimilarity.similarity.jaccard(
  s1,
  s2,
  (useWords = false),
  (caseSensitive = true),
  (ngramSize = 2),
);
textSimilarity.similarity.dice(
  s1,
  s2,
  (useWords = false),
  (caseSensitive = true),
  (ngramSize = 2),
);
textSimilarity.similarity.cosine(
  s1,
  s2,
  (useWords = false),
  (caseSensitive = true),
  (ngramSize = 2),
);
textSimilarity.similarity.tversky(
  s1,
  s2,
  alpha,
  beta,
  (useWords = false),
  (caseSensitive = true),
  (ngramSize = 2),
);

Distance Functions

textSimilarity.distance.levenshtein(s1, s2, (caseSensitive = true));
textSimilarity.distance.damerauLevenshtein(s1, s2, (caseSensitive = true));
textSimilarity.distance.hamming(s1, s2, (caseSensitive = true));
textSimilarity.distance.euclidean(
  s1,
  s2,
  (useWords = false),
  (caseSensitive = true),
  (ngramSize = 2),
);
textSimilarity.distance.manhattan(
  s1,
  s2,
  (useWords = false),
  (caseSensitive = true),
  (ngramSize = 2),
);
textSimilarity.distance.chebyshev(
  s1,
  s2,
  (useWords = false),
  (caseSensitive = true),
  (ngramSize = 2),
);

Asynchronous API

All algorithms support async execution with worker threads:

// All similarity algorithms available in async form
await textSimilarity.async.levenshtein(s1, s2, caseSensitive);
await textSimilarity.async.jaccard(s1, s2, useWords, caseSensitive, ngramSize);
await textSimilarity.async.cosine(s1, s2, useWords, caseSensitive, ngramSize);
await textSimilarity.async.jaro(s1, s2, caseSensitive);
await textSimilarity.async.jaroWinkler(s1, s2, caseSensitive, prefixWeight);
// ... and more

Library Comparison

Algorithm Category	text-similarity-node	string-comparison	similarity
Edit-Based Algorithms
Levenshtein Distance	✅	✅	❌
Levenshtein Similarity	✅	✅	✅
Damerau-Levenshtein	✅	❌	❌
Hamming Distance	✅	❌	❌
Jaro Similarity	✅	✅	❌
Jaro-Winkler	✅	✅	❌
Token-Based Algorithms
Jaccard Similarity	✅	✅	❌
Sorensen-Dice	✅	❌	❌
Tversky Index	✅	❌	❌
Overlap Coefficient	✅	❌	❌
Cosine Similarity	✅	✅	❌
Vector-Based Algorithms
Euclidean Distance	✅	❌	❌
Manhattan Distance	✅	❌	❌
Chebyshev Distance	✅	❌	❌
Sequence-Based Algorithms
LCS (Longest Common Subsequence)	❌	✅	❌
Ratcliff-Obershelp	❌	❌	❌
Configuration & Features
Case-insensitive comparison	✅	✅	✅
Configurable n-gram sizes	✅	❌	❌
Word vs character tokenization	✅	❌	❌
Unicode normalization	✅	Partial	❌
Emoji support	✅	✅	✅
Performance & API
Native implementation (C++)	✅	❌	❌
Asynchronous API	✅	❌	❌
Worker thread support	✅	❌	❌
TypeScript definitions	✅	✅	✅
Memory optimization	✅	❌	❌

Performance Comparison

Based on extensive benchmarks, text-similarity-node stands out by delivering exceptional performance and scalability where it matters most.

Unmatched Memory Efficiency

Built with a native C++ core, text-similarity-node delivers a minimal memory footprint—ideal for memory-sensitive applications and large-scale data processing.

Jaccard Similarity: Uses just 392 bytes of heap memory, compared to over 35 KB for competitors like string-comparison (nearly 90× more).
Dice Coefficient: Allocates only 392 bytes, while alternatives require over 3 KB.

Exceptional Performance on Long Texts

text-similarity-node is optimized for long strings, outperforming JavaScript-based libraries:

For strings 70+ characters, it's nearly 6× faster than the popular similarity library.
For very long strings (1000+ characters), it's over 1000× faster, processing hundreds of thousands of operations per second while alternatives slow dramatically.

Dominant Speed in Key Algorithms

The library leads in performance for modern similarity use cases:

Jaccard Similarity: Over 5× faster than string-comparison — ideal for tag or keyword analysis.
Flexible Analysis Modes: Built-in character and word modes for Jaccard, Cosine, and Dice algorithms provide greater control over results.

Unicode Support

Comprehensive Unicode support with proper handling of:

International Characters: Latin, Cyrillic, Greek, Chinese, Japanese, Arabic
Diacritics: Proper case-insensitive matching (café ↔ CAFÉ)
Emoji: Full emoji support including complex emoji sequences
Mixed Scripts: Seamless handling of multilingual text
Normalization: Automatic Unicode normalization for accurate comparisons

// International text examples
textSimilarity.similarity.levenshtein("Москва", "москва", false); // 1.0
textSimilarity.similarity.jaccard("你好世界", "你好世间"); // 0.5
// Emoji support
textSimilarity.similarity.cosine("Hello 👋🌍", "Hello 👋🌎"); // 0.86

Development

Building from Source

# Install dependencies
npm install

# Build native addon
npm run build

# Run tests
npm test

Requirements

Runtime: Node.js 16.0.0+
Build Tools:
- Windows: Visual Studio Build Tools or Visual Studio
- macOS: Xcode Command Line Tools (xcode-select --install)
- Linux: build-essential package (sudo apt-get install build-essential)
Architectures: x64, ARM64
Platforms: Windows, macOS, Linux

Quick Start for Contributors

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes with tests
Run the test suite: npm test
Submit a pull request

Don't forget to exclude prebuilds directory from your pull request!

License

MIT License - see LICENSE file for details.

Acknowledgments

This library was created using a reference implementation TextDistance Python library, which provided a solid foundation for the algorithms and features included in this library.

Keywords

FAQs

What is text-similarity-node?

Is text-similarity-node popular?

Is text-similarity-node well maintained?

Package last updated on 15 Feb 2026

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

text-similarity-node

text-similarity-node

Key Features

Prerequisites

Windows

macOS

Linux

Installation

CLI Usage

Similarity

Distance

Batch Processing

List Algorithms

All CLI Options

Quick Start

Algorithm Overview

Edit-Based Algorithms

Token-Based Algorithms

Vector-Based Algorithms

API Reference

Modern API

Configuration Options

Utility Functions

Convenience API

Similarity Functions

Distance Functions

Asynchronous API

Library Comparison

Performance Comparison

Unmatched Memory Efficiency

Exceptional Performance on Long Texts

Dominant Speed in Key Algorithms

Unicode Support

Development

Building from Source

Requirements

Quick Start for Contributors

License

Acknowledgments

Keywords

Related posts

Socket MCP Adds Org Alerts, Threat Feed Review, and Package Inspection

Socket Firewall Now Blocks Malicious VS Code and Open VSX Extensions