Dice's Coefficient: Fast surface-level text similarity using n-gram analysis
Transformer Embeddings: Semantic similarity using sentence-transformers/all-MiniLM-L6-v2
Embedding Cache: Pre-compute and cache embeddings for fast similarity search
Corpus Management: Load and search through large text collections efficiently
Batch Similarity: Calculate similarities against entire corpus at once
Benchmarking: Compare performance and accuracy between methods
Electron App: Built-in GUI for interactive benchmarking
Cross-platform: Works in Node.js and Electron (main & renderer processes)

Installation

npm install sentence2simvecjs

Usage

As a Library

const {
  diceCoefficient,
  embeddingSimilarity,
  runBenchmark,
  initializeEmbeddingModel
} = require('sentence2simvecjs');

// Simple Dice's Coefficient
const diceScore = diceCoefficient("Hello world", "Hello there");
console.log(diceScore); // 0.5

// Embedding similarity (async)
async function example() {
  // Initialize model once (optional, will auto-init on first use)
  await initializeEmbeddingModel();

  const result = await embeddingSimilarity("Hello world", "Hello there");
  console.log(result.score); // 0.7234
  console.log(result.executionTime); // 123.45 ms
}

// Run benchmark comparison
async function benchmark() {
  const result = await runBenchmark("Hello world", "Hello there", {
    ngramSize: 3,
    preloadModel: true
  });

  console.log('Dice Score:', result.diceResult.score);
  console.log('Embedding Score:', result.embeddingResult.score);
  console.log('Speed ratio:', result.embeddingResult.executionTime / result.diceResult.executionTime);
}

With Embedding Cache

const { EmbeddingCache, CorpusManager } = require('sentence2simvecjs');

// Create embedding cache
const cache = new EmbeddingCache({
  persistToDisk: true,
  cacheDir: './embeddings'
});

// Add texts to cache
await cache.addText('Machine learning is awesome');
await cache.addTextsFromFile('corpus.txt');
await cache.addTextsFromJSON('data.json', 'content');

// Find similar texts
const similar = await cache.findSimilar('Deep learning', 5);

// Batch similarity calculation
const scores = await cache.batchSimilarity('Neural networks');

With Corpus Manager

const corpus = new CorpusManager({
  enableDiceCache: true,
  enableEmbeddingCache: true
});

// Load corpus
await corpus.loadFromFile('documents.txt');
await corpus.addItems([
  { text: 'First document', id: 'doc1' },
  { text: 'Second document', id: 'doc2' }
]);

// Search using both methods
const results = await corpus.search('query text', 'both', 10);

// Batch similarity for entire corpus
const allScores = await corpus.batchSimilarity('query text', 'embedding');

As an Electron App

# Clone the repository
git clone https://github.com/your-username/sentence2simvecjs
cd sentence2simvecjs

# Install dependencies
npm install

# Build and run
npm start

API

`diceCoefficient(text1: string, text2: string, ngramSize?: number): number`

Calculate Dice's coefficient between two texts using n-grams.

text1, text2: Input texts to compare
ngramSize: Size of n-grams (default: 3)
Returns: Similarity score between 0.0 and 1.0

`embeddingSimilarity(text1: string, text2: string): Promise<Result>`

Calculate semantic similarity using transformer embeddings.

Returns: Object with score, embedding1, embedding2, and executionTime

`runBenchmark(text1: string, text2: string, options?: Options): Promise<ComparisonResult>`

Run both similarity methods and compare performance.

options.ngramSize: N-gram size for Dice's coefficient
options.preloadModel: Whether to preload the transformer model

`EmbeddingCache`

Pre-compute and cache embeddings for fast retrieval.

addText(text, id?, metadata?): Add single text to cache
addTexts(texts): Add multiple texts
addTextsFromFile(filePath): Load texts from file
findSimilar(query, topK, threshold?): Find similar cached texts
batchSimilarity(query): Get all similarity scores

`CorpusManager`

Manage large text collections with both Dice and embedding methods.

addItem(text, id?, metadata?): Add text to corpus
loadFromFile(filePath, format): Load corpus from file
search(query, method, topK): Search corpus
batchSimilarity(query, method): Calculate all similarities

Performance

Dice's Coefficient: ~0.1ms per comparison
Transformer Embeddings: ~50-200ms per comparison (after model initialization)
Cached Embeddings: <1ms per comparison (after initial computation)

Initial model loading takes 1-3 seconds depending on hardware.

Cache Storage

Storage Options

The new EmbeddingCacheV2 supports multiple storage backends:

File System (Node.js)
LocalStorage (Browser)
Memory (Both environments)

// File storage (Node.js)
const fileCache = new EmbeddingCacheV2({
  storageType: 'file',
  cacheDir: './embeddings'
});

// LocalStorage (Browser)
const browserCache = new EmbeddingCacheV2({
  storageType: 'localStorage',
  storagePrefix: 'myapp_embeddings_',
  maxItems: 1000  // Limit items to prevent quota issues
});

// Memory storage (default)
const memoryCache = new EmbeddingCacheV2({
  storageType: 'memory'
});

// Custom storage adapter
const customCache = new EmbeddingCacheV2({
  storageAdapter: myCustomAdapter  // Implement StorageAdapter interface
});

Browser LocalStorage Example

<script type="module">
import { EmbeddingCacheV2, initializeEmbeddingModel } from 'sentence2simvecjs';

async function setupBrowserCache() {
  await initializeEmbeddingModel();

  const cache = new EmbeddingCacheV2({
    storageType: 'localStorage',
    storagePrefix: 'embeddings_',
    maxItems: 500  // Prevent localStorage quota exceeded
  });

  // Add texts
  await cache.addText('Example text');

  // Find similar
  const results = await cache.findSimilar('Query text', 5);

  // Check storage usage
  const info = await cache.getStorageInfo();
  console.log(`Using ${info.estimatedSize / 1024}KB of localStorage`);
}
</script>

Legacy Cache (File-only)

The original EmbeddingCache still works for backward compatibility:

// Original file-based cache
const cache = new EmbeddingCache({
  persistToDisk: true,
  cacheDir: '/path/to/my/cache'
});

Cache File Format

The cache is stored as JSON with the following structure:

[
  {
    "id": "unique_id",
    "text": "Original text",
    "embedding": [0.123, -0.456, ...],  // 384-dimensional array
    "metadata": { /* optional metadata */ }
  }
]

Cache Management

// Clear all cache (works with all storage types)
await cache.clear();  // Removes all cached embeddings

// Remove specific item
await cache.remove('specific_id');

// Export/Import (works with all storage types)
const jsonData = await cache.exportToJSON();
await cache.importFromJSON(jsonData);

// Check storage info
const info = await cache.getStorageInfo();
console.log(`Storage type: ${info.type}`);
console.log(`Items: ${info.itemCount}`);
console.log(`Size: ${info.estimatedSize} bytes`);

Clearing Cache Safely

The clear() method removes all cached embeddings:

LocalStorage: Only removes items with the specified prefix
File System: Deletes the cache directory contents
Memory: Clears the in-memory Map

// LocalStorage example - only clears items with 'myapp_' prefix
const cache = new EmbeddingCacheV2({
  storageType: 'localStorage',
  storagePrefix: 'myapp_'  // Only 'myapp_*' keys will be cleared
});

await cache.clear();  // Other localStorage data remains untouched

// Confirm deletion
const remaining = await cache.size();
console.log(`Items after clear: ${remaining}`);  // Should be 0

Storage Limitations

LocalStorage: ~5-10MB limit in most browsers
File System: Limited by disk space
Memory: Limited by available RAM

Use maxItems option to prevent storage overflow:

const cache = new EmbeddingCacheV2({
  storageType: 'localStorage',
  maxItems: 500  // Automatically removes oldest items
});

Model Storage in Browser

When using @xenova/transformers in the browser, the model files are stored separately from your embedding cache:

Where Models are Stored

Location: Browser's Cache Storage API (not localStorage)
Path: Accessible via DevTools → Application → Cache Storage → transformers-cache
Size: ~25MB for all-MiniLM-L6-v2 model
Persistence: Survives page reloads, cleared with browser cache

Viewing Cached Models

Open DevTools (F12)
Go to Application (Chrome) or Storage (Firefox) tab
Expand "Cache Storage"
Look for transformers-cache or similar

Model Cache vs Embedding Cache

Model Cache: Stores the AI model files (Cache Storage API)
Embedding Cache: Stores computed embeddings (localStorage/file/memory)

Clearing Model Cache

// Clear transformer model cache
caches.keys().then(names => {
  names.forEach(name => {
    if (name.includes('transformers')) {
      caches.delete(name);
    }
  });
});

// Clear embedding cache (your computed results)
await cache.clear();

Browser Usage

To use in a browser environment:

Build the browser bundle:

npm run build:browser

Serve the files using a local server (to avoid CORS issues):

npm run serve
# Or use any static file server

Access the test pages:
- Dice coefficient only: http://localhost:8000/src/browser/test-dice-only.html
- Full test with embeddings: http://localhost:8000/src/browser/test-localstorage.html

Test Page Usage Guide

Note: The embedding model initialization may take 10-30 seconds on first load as it downloads the model files (~25MB) from Hugging Face. The Dice-only test page works immediately without any model download.

The test page provides an interactive interface to test the LocalStorage cache functionality:

1. Add Text to Cache

Text input: Enter any sentence or paragraph you want to cache
- Example: "Machine learning is a subset of artificial intelligence"
Optional ID: Provide a custom ID, or leave blank for auto-generated ID
- Example: "ml_definition"

2. Bulk Add

Add multiple texts at once (one per line):

Natural language processing enables computers to understand text
Deep learning models can learn complex patterns
Neural networks are inspired by the human brain
JavaScript is a programming language for web development
React is a library for building user interfaces

3. Find Similar

Enter a query to find similar cached texts:
- Example: "AI and machine learning"
- Example: "Web development frameworks"
Shows top 5 most similar texts with similarity scores (0.0-1.0)

4. Cache Data

Export Format: JSON file containing all cached embeddings

File Structure:

[
  {
    "id": "text_1077264583",        // Unique identifier (auto-generated or custom)
    "text": "こんにちは",              // Original text
    "embedding": [                   // 384-dimensional vector from all-MiniLM-L6-v2
      -0.10119643807411194,
      // ... (382 more values)
      -0.008699539117515087
    ],
    "timestamp": 1753166234369       // Unix timestamp when cached
  },
  {
    "id": "text_1712359701",
    "text": "はじめまして",
    "embedding": [
      -0.031796280294656754,
      // ... (382 more values)
      -0.005393804516643286
    ],
    "timestamp": 1753166261449
  },
  {
    "id": "text_6942345",
    "text": "今日はいい天気ですね。",
    "embedding": [
      0.03111492656171322,
      // ... (382 more values)
      -0.012813657522201538
    ],
    "timestamp": 1753166295569
  },
  {
    "id": "text_2137068100",
    "text": "Hello.",
    "embedding": [
      -0.09045851230621338,
      // ... (382 more values)
      0.015684669837355614
    ],
    "timestamp": 1753167371990
  },
  {
    "id": "text_1654144361",
    "text": "Hello. Good morning.",
    "embedding": [
      -0.025240488350391388,
      // ... (382 more values)
      0.00397441117092967
    ],
    "timestamp": 1753167383761
  }
]

Field Descriptions:
- id: Unique identifier for each cached text
  - Auto-generated format: text_[hash] (e.g., "text_1077264583")
  - Custom format: User-provided ID (e.g., "ml_definition")
- text: The original text string that was embedded
- embedding: 384-dimensional Float32Array from all-MiniLM-L6-v2 model
  - Normalized vector (L2 norm = 1.0)
  - Used for cosine similarity calculations
- timestamp: Unix timestamp (milliseconds since epoch)
  - Used for cache management (oldest items removed when maxItems is reached)
File Size: Approximately 3-4KB per cached text (including JSON overhead)

5. Storage Management

Storage Info: Shows current storage usage and item count
Show/Hide Cached Texts: Toggle button to display all cached texts with their IDs
- Displays in a scrollable list (max height: 200px)
- Shows ID and full text for each cached item
- Automatically updates when texts are added/removed
Clear Cache: Removes all cached embeddings (with confirmation)
Export/Import: Save cache to JSON file or load from file

6. Performance Metrics

Search Time Display: Shows processing time in milliseconds for each search
- Format: "Similar Texts (Found in XX.XXms):"
- Measures the complete findSimilar execution time
- Helps understand the performance benefit of cached embeddings

Example Workflow:

Add several texts using "Bulk Add" (copy the example above)
Click "Show Cached Texts" to view all stored items
Search for "artificial intelligence" to find AI-related texts
- Note the search time (e.g., "Found in 23.45ms")
Search for "programming" to find coding-related texts
Export your cache to save the embeddings
Clear cache and import to restore

Including in Your Web Page

<script src="path/to/sentence2simvecjs/dist/browser.js"></script>
<script>
  const { EmbeddingCacheV2, initializeEmbeddingModel } = window.sentence2simvecjs;

  async function init() {
    await initializeEmbeddingModel();
    const cache = new EmbeddingCacheV2({
      storageType: 'localStorage'
    });
    // Use the cache...
  }
</script>

Note: Direct file:// access will cause CORS errors. Always serve through HTTP/HTTPS.

OffscreenCanvas Visualization

This library includes high-performance visualization components using OffscreenCanvas and Web Workers for non-blocking rendering.

Features

OffscreenCanvas Rendering: Moves canvas operations to Web Worker threads
Non-blocking UI: Heavy visualizations don't freeze the main thread
Multiple Chart Types:
- Heatmap: Similarity matrix visualization
- Bar Chart: Performance comparison (Dice vs Embedding times)
- Scatter Plot: Score correlation analysis

Usage

import { SimilarityVisualization } from 'sentence2simvecjs/renderer';

// In your React component
<SimilarityVisualization
  data={benchmarkResults}
  type="heatmap"  // or "barchart" or "scatter"
  width={600}
  height={400}
  title="Similarity Matrix"
/>

Browser Support

OffscreenCanvas is supported in:

Chrome 69+
Firefox 105+
Edge 79+
Safari 16.4+ (partial support)

The visualization component automatically falls back to main thread rendering for unsupported browsers.

Test Page

To test OffscreenCanvas visualization:

npm run serve
# Navigate to http://localhost:8000/src/browser/test-offscreencanvas.html

The test page includes:

Browser compatibility check
Interactive visualization demos
Performance benchmarking
Stress testing with 1000+ data points

Performance Benefits

Using OffscreenCanvas provides:

60fps UI: Main thread remains responsive during heavy rendering
Parallel Processing: Multiple visualizations can render simultaneously
Better UX: No freezing when processing large datasets
Scalability: Handle thousands of data points smoothly

License

Apache-2.0

Credits

Inspired by PINTO0309/sentence2simvec

Keywords

FAQs

What is sentence2simvecjs?

Is sentence2simvecjs well maintained?

Package last updated on 22 Jul 2025

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

sentence2simvecjs