Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

semantic-chunking

Package Overview
Dependencies
Maintainers
1
Versions
30
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

semantic-chunking

semantically create chunks from large text (useful for passing to LLM workflows)

  • 1.0.0
  • Source
  • npm
  • Socket score

Version published
Weekly downloads
110
decreased by-81.54%
Maintainers
1
Weekly downloads
 
Created
Source

🍱 semantic-chunking

semantically create chunks from large text (useful for passing to LLM workflows)


Install:

  • npm install semantic-chunking

Usage

import { chunkit } from 'semantic-chunking';

let text = "some long text...";
let myChunks = chunkit(text);

All Parameters

chunkit(
    text,
    { // options object
        logging,
        maxTokenSize,
        similarityThreshold,
        onnxEmbeddingModel,
        onnxEmbeddingModelQuantized,
        combineSimilarityChunks
    }
)
  • text
    full string to split into chunks

  • options object [optional]

    • logging [optional | boolean | default false ]
    • maxTokenSize [optional | int | default 500]
      max possible token size of each chunk
    • similarityThreshold [optional | float | default .567]
      threshold value used to determin if paired sentences are semantically close enough to be included in the same chunk
    • onnxEmbeddingModel [optional | string | default Xenova/paraphrase-multilingual-MiniLM-L12-v2]
      ONNX model to use for creating embeddings for similarity comparison (model name on huggingface)
    • onnxEmbeddingModelQuantized [optional | boolean | default true]
      if the quantized version of the model should be used
    • combineSimilarityChunks [optional | boolean | default true]
      if true, the initial round of smaller semantic similar chunks are combined to make larger chunks up to the defined max token limit

Workflow

  • text is split into an array of sentences
  • a vector is created for each sentence
  • a cosine similarity score is created for each sentence pair
  • each sentence is added to a chunk until the similarity threshold or max token size for the chunk is exceeded
  • after all similary chunks are created combine similary chunks into large chunks up to the max token size unless the combineSimilaryityChunks was set to false

Example Calls

import { chunkit } from 'semantic-chunking';

const text = await fs.promises.readFile('./example.txt', 'utf8');
let myChunks = await chunkit(text, { logging: true, similarityThreshold: .9 });

myChunks.forEach((chunk, index) => {
    console.log("--------------------");
    console.log("Chunk " + (index + 1));
    console.log("--------------------");
    console.log(chunk);
    console.log("\n\n");
});

import { chunkit } from 'semantic-chunking';

let frogText = "A frog hops into a deli and croaks to the cashier, \"I'll have a sandwich, please.\" The cashier, surprised, quickly makes the sandwich and hands it over. The frog takes a big bite, looks around, and then asks, \"Do you have any flies to go with this?\" The cashier, taken aback, replies, \"Sorry, we're all out of flies today.\" The frog shrugs and continues munching on its sandwich, clearly unfazed by the lack of fly toppings. Just another day in the life of a sandwich-loving amphibian! 🐸🥪";

let myFrogChunks = await chunkit(frogText, { maxTokenSize: 65 });
console.log("myFrogChunks", myFrogChunks);

Keywords

FAQs

Package last updated on 29 Feb 2024

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc