New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details
Socket
Book a DemoSign in
Socket

sentence2simvecjs

Package Overview
Dependencies
Maintainers
0
Versions
2
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

sentence2simvecjs

Vector-based sentence similarity (0.0-1.0) + embedding export. JavaScript implementation inspired by PINTO0309/sentence2simvec

latest
Source
npmnpm
Version
1.0.1
Version published
Maintainers
0
Created
Source

sentence2simvecjs

Vector-based sentence similarity (0.0–1.0) + embedding export. JavaScript implementation inspired by PINTO0309/sentence2simvec.

https://github.com/user-attachments/assets/4738b015-ef68-4503-aa51-a467754d7081

Features

  • Dice's Coefficient: Fast surface-level text similarity using n-gram analysis
  • Transformer Embeddings: Semantic similarity using sentence-transformers/all-MiniLM-L6-v2
  • Embedding Cache: Pre-compute and cache embeddings for fast similarity search
  • Corpus Management: Load and search through large text collections efficiently
  • Batch Similarity: Calculate similarities against entire corpus at once
  • Benchmarking: Compare performance and accuracy between methods
  • Electron App: Built-in GUI for interactive benchmarking
  • Cross-platform: Works in Node.js and Electron (main & renderer processes)

Installation

npm install sentence2simvecjs

Usage

As a Library

const {
  diceCoefficient,
  embeddingSimilarity,
  runBenchmark,
  initializeEmbeddingModel
} = require('sentence2simvecjs');

// Simple Dice's Coefficient
const diceScore = diceCoefficient("Hello world", "Hello there");
console.log(diceScore); // 0.5

// Embedding similarity (async)
async function example() {
  // Initialize model once (optional, will auto-init on first use)
  await initializeEmbeddingModel();

  const result = await embeddingSimilarity("Hello world", "Hello there");
  console.log(result.score); // 0.7234
  console.log(result.executionTime); // 123.45 ms
}

// Run benchmark comparison
async function benchmark() {
  const result = await runBenchmark("Hello world", "Hello there", {
    ngramSize: 3,
    preloadModel: true
  });

  console.log('Dice Score:', result.diceResult.score);
  console.log('Embedding Score:', result.embeddingResult.score);
  console.log('Speed ratio:', result.embeddingResult.executionTime / result.diceResult.executionTime);
}

With Embedding Cache

const { EmbeddingCache, CorpusManager } = require('sentence2simvecjs');

// Create embedding cache
const cache = new EmbeddingCache({
  persistToDisk: true,
  cacheDir: './embeddings'
});

// Add texts to cache
await cache.addText('Machine learning is awesome');
await cache.addTextsFromFile('corpus.txt');
await cache.addTextsFromJSON('data.json', 'content');

// Find similar texts
const similar = await cache.findSimilar('Deep learning', 5);

// Batch similarity calculation
const scores = await cache.batchSimilarity('Neural networks');

With Corpus Manager

const corpus = new CorpusManager({
  enableDiceCache: true,
  enableEmbeddingCache: true
});

// Load corpus
await corpus.loadFromFile('documents.txt');
await corpus.addItems([
  { text: 'First document', id: 'doc1' },
  { text: 'Second document', id: 'doc2' }
]);

// Search using both methods
const results = await corpus.search('query text', 'both', 10);

// Batch similarity for entire corpus
const allScores = await corpus.batchSimilarity('query text', 'embedding');

As an Electron App

# Clone the repository
git clone https://github.com/your-username/sentence2simvecjs
cd sentence2simvecjs

# Install dependencies
npm install

# Build and run
npm start

API

diceCoefficient(text1: string, text2: string, ngramSize?: number): number

Calculate Dice's coefficient between two texts using n-grams.

  • text1, text2: Input texts to compare
  • ngramSize: Size of n-grams (default: 3)
  • Returns: Similarity score between 0.0 and 1.0

embeddingSimilarity(text1: string, text2: string): Promise<Result>

Calculate semantic similarity using transformer embeddings.

  • Returns: Object with score, embedding1, embedding2, and executionTime

runBenchmark(text1: string, text2: string, options?: Options): Promise<ComparisonResult>

Run both similarity methods and compare performance.

  • options.ngramSize: N-gram size for Dice's coefficient
  • options.preloadModel: Whether to preload the transformer model

EmbeddingCache

Pre-compute and cache embeddings for fast retrieval.

  • addText(text, id?, metadata?): Add single text to cache
  • addTexts(texts): Add multiple texts
  • addTextsFromFile(filePath): Load texts from file
  • findSimilar(query, topK, threshold?): Find similar cached texts
  • batchSimilarity(query): Get all similarity scores

CorpusManager

Manage large text collections with both Dice and embedding methods.

  • addItem(text, id?, metadata?): Add text to corpus
  • loadFromFile(filePath, format): Load corpus from file
  • search(query, method, topK): Search corpus
  • batchSimilarity(query, method): Calculate all similarities

Performance

  • Dice's Coefficient: ~0.1ms per comparison
  • Transformer Embeddings: ~50-200ms per comparison (after model initialization)
  • Cached Embeddings: <1ms per comparison (after initial computation)

Initial model loading takes 1-3 seconds depending on hardware.

Cache Storage

Storage Options

The new EmbeddingCacheV2 supports multiple storage backends:

  • File System (Node.js)
  • LocalStorage (Browser)
  • Memory (Both environments)
// File storage (Node.js)
const fileCache = new EmbeddingCacheV2({
  storageType: 'file',
  cacheDir: './embeddings'
});

// LocalStorage (Browser)
const browserCache = new EmbeddingCacheV2({
  storageType: 'localStorage',
  storagePrefix: 'myapp_embeddings_',
  maxItems: 1000  // Limit items to prevent quota issues
});

// Memory storage (default)
const memoryCache = new EmbeddingCacheV2({
  storageType: 'memory'
});

// Custom storage adapter
const customCache = new EmbeddingCacheV2({
  storageAdapter: myCustomAdapter  // Implement StorageAdapter interface
});

Browser LocalStorage Example

<script type="module">
import { EmbeddingCacheV2, initializeEmbeddingModel } from 'sentence2simvecjs';

async function setupBrowserCache() {
  await initializeEmbeddingModel();

  const cache = new EmbeddingCacheV2({
    storageType: 'localStorage',
    storagePrefix: 'embeddings_',
    maxItems: 500  // Prevent localStorage quota exceeded
  });

  // Add texts
  await cache.addText('Example text');

  // Find similar
  const results = await cache.findSimilar('Query text', 5);

  // Check storage usage
  const info = await cache.getStorageInfo();
  console.log(`Using ${info.estimatedSize / 1024}KB of localStorage`);
}
</script>

Legacy Cache (File-only)

The original EmbeddingCache still works for backward compatibility:

// Original file-based cache
const cache = new EmbeddingCache({
  persistToDisk: true,
  cacheDir: '/path/to/my/cache'
});

Cache File Format

The cache is stored as JSON with the following structure:

[
  {
    "id": "unique_id",
    "text": "Original text",
    "embedding": [0.123, -0.456, ...],  // 384-dimensional array
    "metadata": { /* optional metadata */ }
  }
]

Cache Management

// Clear all cache (works with all storage types)
await cache.clear();  // Removes all cached embeddings

// Remove specific item
await cache.remove('specific_id');

// Export/Import (works with all storage types)
const jsonData = await cache.exportToJSON();
await cache.importFromJSON(jsonData);

// Check storage info
const info = await cache.getStorageInfo();
console.log(`Storage type: ${info.type}`);
console.log(`Items: ${info.itemCount}`);
console.log(`Size: ${info.estimatedSize} bytes`);

Clearing Cache Safely

The clear() method removes all cached embeddings:

  • LocalStorage: Only removes items with the specified prefix
  • File System: Deletes the cache directory contents
  • Memory: Clears the in-memory Map
// LocalStorage example - only clears items with 'myapp_' prefix
const cache = new EmbeddingCacheV2({
  storageType: 'localStorage',
  storagePrefix: 'myapp_'  // Only 'myapp_*' keys will be cleared
});

await cache.clear();  // Other localStorage data remains untouched

// Confirm deletion
const remaining = await cache.size();
console.log(`Items after clear: ${remaining}`);  // Should be 0

Storage Limitations

  • LocalStorage: ~5-10MB limit in most browsers
  • File System: Limited by disk space
  • Memory: Limited by available RAM

Use maxItems option to prevent storage overflow:

const cache = new EmbeddingCacheV2({
  storageType: 'localStorage',
  maxItems: 500  // Automatically removes oldest items
});

Model Storage in Browser

When using @xenova/transformers in the browser, the model files are stored separately from your embedding cache:

Where Models are Stored

  • Location: Browser's Cache Storage API (not localStorage)
  • Path: Accessible via DevTools → Application → Cache Storage → transformers-cache
  • Size: ~25MB for all-MiniLM-L6-v2 model
  • Persistence: Survives page reloads, cleared with browser cache

Viewing Cached Models

  • Open DevTools (F12)
  • Go to Application (Chrome) or Storage (Firefox) tab
  • Expand "Cache Storage"
  • Look for transformers-cache or similar

Model Cache vs Embedding Cache

  • Model Cache: Stores the AI model files (Cache Storage API)
  • Embedding Cache: Stores computed embeddings (localStorage/file/memory)

Clearing Model Cache

// Clear transformer model cache
caches.keys().then(names => {
  names.forEach(name => {
    if (name.includes('transformers')) {
      caches.delete(name);
    }
  });
});

// Clear embedding cache (your computed results)
await cache.clear();

Browser Usage

To use in a browser environment:

  • Build the browser bundle:
npm run build:browser
  • Serve the files using a local server (to avoid CORS issues):
npm run serve
# Or use any static file server
  • Access the test pages:
    • Dice coefficient only: http://localhost:8000/src/browser/test-dice-only.html
    • Full test with embeddings: http://localhost:8000/src/browser/test-localstorage.html

Test Page Usage Guide

Note: The embedding model initialization may take 10-30 seconds on first load as it downloads the model files (~25MB) from Hugging Face. The Dice-only test page works immediately without any model download.

The test page provides an interactive interface to test the LocalStorage cache functionality:

1. Add Text to Cache

  • Text input: Enter any sentence or paragraph you want to cache
    • Example: "Machine learning is a subset of artificial intelligence"
  • Optional ID: Provide a custom ID, or leave blank for auto-generated ID
    • Example: "ml_definition"
20250722155717

2. Bulk Add

  • Add multiple texts at once (one per line):
Natural language processing enables computers to understand text
Deep learning models can learn complex patterns
Neural networks are inspired by the human brain
JavaScript is a programming language for web development
React is a library for building user interfaces

3. Find Similar

  • Enter a query to find similar cached texts:
    • Example: "AI and machine learning"
    • Example: "Web development frameworks"
  • Shows top 5 most similar texts with similarity scores (0.0-1.0)
20250722155739

4. Cache Data

  • Export Format: JSON file containing all cached embeddings
  • File Structure:
    [
      {
        "id": "text_1077264583",        // Unique identifier (auto-generated or custom)
        "text": "こんにちは",              // Original text
        "embedding": [                   // 384-dimensional vector from all-MiniLM-L6-v2
          -0.10119643807411194,
          // ... (382 more values)
          -0.008699539117515087
        ],
        "timestamp": 1753166234369       // Unix timestamp when cached
      },
      {
        "id": "text_1712359701",
        "text": "はじめまして",
        "embedding": [
          -0.031796280294656754,
          // ... (382 more values)
          -0.005393804516643286
        ],
        "timestamp": 1753166261449
      },
      {
        "id": "text_6942345",
        "text": "今日はいい天気ですね。",
        "embedding": [
          0.03111492656171322,
          // ... (382 more values)
          -0.012813657522201538
        ],
        "timestamp": 1753166295569
      },
      {
        "id": "text_2137068100",
        "text": "Hello.",
        "embedding": [
          -0.09045851230621338,
          // ... (382 more values)
          0.015684669837355614
        ],
        "timestamp": 1753167371990
      },
      {
        "id": "text_1654144361",
        "text": "Hello. Good morning.",
        "embedding": [
          -0.025240488350391388,
          // ... (382 more values)
          0.00397441117092967
        ],
        "timestamp": 1753167383761
      }
    ]
    
  • Field Descriptions:
    • id: Unique identifier for each cached text
      • Auto-generated format: text_[hash] (e.g., "text_1077264583")
      • Custom format: User-provided ID (e.g., "ml_definition")
    • text: The original text string that was embedded
    • embedding: 384-dimensional Float32Array from all-MiniLM-L6-v2 model
      • Normalized vector (L2 norm = 1.0)
      • Used for cosine similarity calculations
    • timestamp: Unix timestamp (milliseconds since epoch)
      • Used for cache management (oldest items removed when maxItems is reached)
  • File Size: Approximately 3-4KB per cached text (including JSON overhead)

5. Storage Management

  • Storage Info: Shows current storage usage and item count
  • Show/Hide Cached Texts: Toggle button to display all cached texts with their IDs
    • Displays in a scrollable list (max height: 200px)
    • Shows ID and full text for each cached item
    • Automatically updates when texts are added/removed
  • Clear Cache: Removes all cached embeddings (with confirmation)
  • Export/Import: Save cache to JSON file or load from file

6. Performance Metrics

  • Search Time Display: Shows processing time in milliseconds for each search
    • Format: "Similar Texts (Found in XX.XXms):"
    • Measures the complete findSimilar execution time
    • Helps understand the performance benefit of cached embeddings

Example Workflow:

  • Add several texts using "Bulk Add" (copy the example above)
  • Click "Show Cached Texts" to view all stored items
  • Search for "artificial intelligence" to find AI-related texts
    • Note the search time (e.g., "Found in 23.45ms")
  • Search for "programming" to find coding-related texts
  • Export your cache to save the embeddings
  • Clear cache and import to restore

Including in Your Web Page

<script src="path/to/sentence2simvecjs/dist/browser.js"></script>
<script>
  const { EmbeddingCacheV2, initializeEmbeddingModel } = window.sentence2simvecjs;

  async function init() {
    await initializeEmbeddingModel();
    const cache = new EmbeddingCacheV2({
      storageType: 'localStorage'
    });
    // Use the cache...
  }
</script>

Note: Direct file:// access will cause CORS errors. Always serve through HTTP/HTTPS.

OffscreenCanvas Visualization

This library includes high-performance visualization components using OffscreenCanvas and Web Workers for non-blocking rendering.

Features

  • OffscreenCanvas Rendering: Moves canvas operations to Web Worker threads
  • Non-blocking UI: Heavy visualizations don't freeze the main thread
  • Multiple Chart Types:
    • Heatmap: Similarity matrix visualization
    • Bar Chart: Performance comparison (Dice vs Embedding times)
    • Scatter Plot: Score correlation analysis

Usage

import { SimilarityVisualization } from 'sentence2simvecjs/renderer';

// In your React component
<SimilarityVisualization
  data={benchmarkResults}
  type="heatmap"  // or "barchart" or "scatter"
  width={600}
  height={400}
  title="Similarity Matrix"
/>

Browser Support

OffscreenCanvas is supported in:

  • Chrome 69+
  • Firefox 105+
  • Edge 79+
  • Safari 16.4+ (partial support)

The visualization component automatically falls back to main thread rendering for unsupported browsers.

Test Page

To test OffscreenCanvas visualization:

npm run serve
# Navigate to http://localhost:8000/src/browser/test-offscreencanvas.html

The test page includes:

  • Browser compatibility check
  • Interactive visualization demos
  • Performance benchmarking
  • Stress testing with 1000+ data points

Performance Benefits

Using OffscreenCanvas provides:

  • 60fps UI: Main thread remains responsive during heavy rendering
  • Parallel Processing: Multiple visualizations can render simultaneously
  • Better UX: No freezing when processing large datasets
  • Scalability: Handle thousands of data points smoothly

License

Apache-2.0

Credits

Inspired by PINTO0309/sentence2simvec

Keywords

sentence-similarity

FAQs

Package last updated on 22 Jul 2025

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts