Research
Security News
Quasar RAT Disguised as an npm Package for Detecting Vulnerabilities in Ethereum Smart Contracts
Socket researchers uncover a malicious npm package posing as a tool for detecting vulnerabilities in Etherium smart contracts.
semantic-chunking
Advanced tools
Semantically create chunks from large texts. Useful for workflows involving large language models (LLMs).
NPM Package for Semantically creating chunks from large texts. Useful for workflows involving large language models (LLMs).
how it works
npm install semantic-chunking
Basic usage:
import { chunkit } from 'semantic-chunking';
const documents = [
{ document_name: "document1", document_text: "contents of document 1..." },
{ document_name: "document2", document_text: "contents of document 2..." },
...
];
const chunkitOptions = {};
const myChunks = await chunkit(documents, chunkitOptions);
NOTE 🚨 The Embedding model (onnxEmbeddingModel
) will be downloaded to this package's cache directory the first it is run (file size will depend on the specified model; see the model's table ).
chunkit
accepts an array of document objects and an optional configuration object. Here are the details for each parameter:
documents
: array of documents. each document is an object containing document_name
and document_text
.
documents = [
{ document_name: "document1", document_text: "..." },
{ document_name: "document2", document_text: "..." },
...
]
Chunkit Options Object:
logging
: Boolean (optional, default false
) - Enables logging of detailed processing steps.maxTokenSize
: Integer (optional, default 500
) - Maximum token size for each chunk.similarityThreshold
: Float (optional, default 0.5
) - Threshold to determine if sentences are similar enough to be in the same chunk. A higher value demands higher similarity.dynamicThresholdLowerBound
: Float (optional, default 0.4
) - Minimum possible dynamic similarity threshold.dynamicThresholdUpperBound
: Float (optional, default 0.8
) - Maximum possible dynamic similarity threshold.numSimilaritySentencesLookahead
: Integer (optional, default 3
) - Number of sentences to look ahead for calculating similarity.combineChunks
: Boolean (optional, default true
) - Determines whether to reblance and combine chunks into larger ones up to the max token limit.combineChunksSimilarityThreshold
: Float (optional, default 0.5
) - Threshold for combining chunks based on similarity during the rebalance and combining phase.onnxEmbeddingModel
: String (optional, default Xenova/all-MiniLM-L6-v2
) - ONNX model used for creating embeddings.dtype
: String (optional, default fp32
) - Precision of the embedding model (options: fp32
, fp16
, q8
, q4
).localModelPath
: String (optional, default null
) - Local path to save and load models (example: ./models
).modelCacheDir
: String (optional, default null
) - Directory to cache downloaded models (example: ./models
).returnEmbedding
: Boolean (optional, default false
) - If set to true
, each chunk will include an embedding vector. This is useful for applications that require semantic understanding of the chunks. The embedding model will be the same as the one specified in onnxEmbeddingModel
.returnTokenLength
: Boolean (optional, default false
) - If set to true
, each chunk will include the token length. This can be useful for understanding the size of each chunk in terms of tokens, which is important for token-based processing limits. The token length is calculated using the tokenizer specified in onnxEmbeddingModel
.chunkPrefix
: String (optional, default null
) - A prefix to add to each chunk (e.g., "search_document: "). This is particularly useful when using embedding models that are trained with specific task prefixes, like the nomic-embed-text-v1.5 model. The prefix is added before calculating embeddings or token lengths.excludeChunkPrefixInResults
: Boolean (optional, default false
) - If set to true
, the chunk prefix will be removed from the results. This is useful when you want to remove the prefix from the results while still maintaining the prefix for embedding calculations.The output is an array of chunks, each containing the following properties:
document_id
: Integer - A unique identifier for the document (current timestamp in milliseconds).document_name
: String - The name of the document being chunked (if provided).number_of_chunks
: Integer - The total number of final chunks returned from the input text.chunk_number
: Integer - The number of the current chunk.model_name
: String - The name of the embedding model used.dtype
: String - The precision of the embedding model used (options: fp32
, fp16
, q8
, q4
).text
: String - The chunked text.embedding
: Array - The embedding vector (if returnEmbedding
is true
).token_length
: Integer - The token length (if returnTokenLength
is true
).It is important to understand how the model you choose behaves when chunking your text. It is highly recommended to tweak all the parameters using the Web UI to get the best results for your use case. Web UI README
Example 1: Basic usage with custom similarity threshold:
import { chunkit } from 'semantic-chunking';
import fs from 'fs';
async function main() {
const documents = [
{
document_name: "test document",
document_text: await fs.promises.readFile('./test.txt', 'utf8')
}
];
let myChunks = await chunkit(documents, { similarityThreshold: 0.4 });
myChunks.forEach((chunk, index) => {
console.log(`\n-- Chunk ${index + 1} --`);
console.log(chunk);
});
}
main();
Example 2: Chunking with a small max token size:
import { chunkit } from 'semantic-chunking';
const frogText = "A frog hops into a deli and croaks to the cashier, \"I'll have a sandwich, please.\" The cashier, surprised, quickly makes the sandwich and hands it over. The frog takes a big bite, looks around, and then asks, \"Do you have any flies to go with this?\" The cashier, taken aback, replies, \"Sorry, we're all out of flies today.\" The frog shrugs and continues munching on its sandwich, clearly unfazed by the lack of fly toppings. Just another day in the life of a sandwich-loving amphibian! 🐸🥪";
const documents = [
{
document_name: "frog document",
document_text: frogText
}
];
async function main() {
let myFrogChunks = await chunkit(documents, { maxTokenSize: 65 });
console.log("myFrogChunks", myFrogChunks);
}
main();
Look at the example\example-chunkit.js
file for a more complex example of using all the optional parameters.
The behavior of the chunkit
function can be finely tuned using several optional parameters in the options object. Understanding how each parameter affects the function can help you optimize the chunking process for your specific requirements.
logging
false
maxTokenSize
500
similarityThreshold
0.456
dynamicThresholdLowerBound
0.2
dynamicThresholdUpperBound
0.8
numSimilaritySentencesLookahead
2
combineChunks
true
maxTokenSize
. This can enhance the readability of the output by grouping closely related content more effectively.combineChunksSimilarityThreshold
0.4
similarityThreshold
, but specifically for rebalancing existing chunks. Adjusting this parameter can help in fine-tuning the granularity of the final chunks.onnxEmbeddingModel
Xenova/all-MiniLM-L6-v2
dtype
fp32
fp32
, fp16
, q8
, q4
.
fp32
is the highest precision but also the largest size and slowest to load. q8
is a good compromise between size and speed if the model supports it. All models support fp32
, but only some support fp16
, q8
, and q4
.Model | Precision | Link | Size |
---|---|---|---|
nomic-ai/nomic-embed-text-v1.5 | fp32, q8 | https://huggingface.co/nomic-ai/nomic-embed-text-v1.5 | 548 MB, 138 MB |
thenlper/gte-base | fp32 | https://huggingface.co/thenlper/gte-base | 436 MB |
Xenova/all-MiniLM-L6-v2 | fp32, fp16, q8 | https://huggingface.co/Xenova/all-MiniLM-L6-v2 | 23 MB, 45 MB, 90 MB |
Xenova/paraphrase-multilingual-MiniLM-L12-v2 | fp32, fp16, q8 | https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2 | 470 MB, 235 MB, 118 MB |
Xenova/all-distilroberta-v1 | fp32, fp16, q8 | https://huggingface.co/Xenova/all-distilroberta-v1 | 326 MB, 163 MB, 82 MB |
BAAI/bge-base-en-v1.5 | fp32 | https://huggingface.co/BAAI/bge-base-en-v1.5 | 436 MB |
BAAI/bge-small-en-v1.5 | fp32 | https://huggingface.co/BAAI/bge-small-en-v1.5 | 133 MB |
yashvardhan7/snowflake-arctic-embed-m-onnx | fp32 | https://huggingface.co/yashvardhan7/snowflake-arctic-embed-m-onnx | 436 MB |
Each of these parameters allows you to customize the chunkit
function to better fit the text size, content complexity, and performance requirements of your application.
The Semantic Chunking Web UI allows you to experiment with the chunking parameters and see the results in real-time. This tool provides a visual way to test and configure the semantic-chunking
library's settings to get optimal results for your specific use case. Once you've found the best settings, you can generate code to implement them in your project.
cramit
- 🧼 The Quick & DirtyThere is an additional function you can import to just "cram" sentences together till they meet your target token size for when you just need quick, high desity chunks.
cramit
accepts an array of document objects and an optional configuration object. Here are the details for each parameter:
documents
: array of documents. each document is an object containing document_name
and document_text
.
documents = [
{ document_name: "document1", document_text: "..." },
{ document_name: "document2", document_text: "..." },
...
]
Cramit Options Object:
logging
: Boolean (optional, default false
) - Enables logging of detailed processing steps.maxTokenSize
: Integer (optional, default 500
) - Maximum token size for each chunk.onnxEmbeddingModel
: String (optional, default Xenova/all-MiniLM-L6-v2
) - ONNX model used for creating embeddings.dtype
: String (optional, default fp32
) - Precision of the embedding model (options: fp32
, fp16
, q8
, q4
).localModelPath
: String (optional, default null
) - Local path to save and load models (example: ./models
).modelCacheDir
: String (optional, default null
) - Directory to cache downloaded models (example: ./models
).returnEmbedding
: Boolean (optional, default false
) - If set to true
, each chunk will include an embedding vector. This is useful for applications that require semantic understanding of the chunks. The embedding model will be the same as the one specified in onnxEmbeddingModel
.returnTokenLength
: Boolean (optional, default false
) - If set to true
, each chunk will include the token length. This can be useful for understanding the size of each chunk in terms of tokens, which is important for token-based processing limits. The token length is calculated using the tokenizer specified in onnxEmbeddingModel
.chunkPrefix
: String (optional, default null
) - A prefix to add to each chunk (e.g., "search_document: "). This is particularly useful when using embedding models that are trained with specific task prefixes, like the nomic-embed-text-v1.5 model. The prefix is added before calculating embeddings or token lengths.excludeChunkPrefixInResults
: Boolean (optional, default false
) - If set to true
, the chunk prefix will be removed from the results. This is useful when you want to remove the prefix from the results while still maintaining the prefix for embedding calculations.Basic usage:
import { cramit } from 'semantic-chunking';
let frogText = "A frog hops into a deli and croaks to the cashier, \"I'll have a sandwich, please.\" The cashier, surprised, quickly makes the sandwich and hands it over. The frog takes a big bite, looks around, and then asks, \"Do you have any flies to go with this?\" The cashier, taken aback, replies, \"Sorry, we're all out of flies today.\" The frog shrugs and continues munching on its sandwich, clearly unfazed by the lack of fly toppings. Just another day in the life of a sandwich-loving amphibian! 🐸🥪";
// initialize documents array and add the frog text to it
let documents = [];
documents.push({
document_name: "frog document",
document_text: frogText
});
// call the cramit function passing in the documents array and the options object
async function main() {
let myFrogChunks = await cramit(documents, { maxTokenSize: 65 });
console.log("myFrogChunks", myFrogChunks);
}
main();
Look at the example\example-cramit.js
file in the root of this project for a more complex example of using all the optional parameters.
sentenceit
- ✂️ When you just need a Clean SplitThere is an additional function you can import to just split sentences.
sentenceit
accepts an array of document objects and an optional configuration object. Here are the details for each parameter:
documents
: array of documents. each document is an object containing document_name
and document_text
.
documents = [
{ document_name: "document1", document_text: "..." },
{ document_name: "document2", document_text: "..." },
...
]
Sentenceit Options Object:
logging
: Boolean (optional, default false
) - Enables logging of detailed processing steps.onnxEmbeddingModel
: String (optional, default Xenova/all-MiniLM-L6-v2
) - ONNX model used for creating embeddings.dtype
: String (optional, default fp32
) - Precision of the embedding model (options: fp32
, fp16
, q8
, q4
).localModelPath
: String (optional, default null
) - Local path to save and load models (example: ./models
).modelCacheDir
: String (optional, default null
) - Directory to cache downloaded models (example: ./models
).returnEmbedding
: Boolean (optional, default false
) - If set to true
, each chunk will include an embedding vector. This is useful for applications that require semantic understanding of the chunks. The embedding model will be the same as the one specified in onnxEmbeddingModel
.returnTokenLength
: Boolean (optional, default false
) - If set to true
, each chunk will include the token length. This can be useful for understanding the size of each chunk in terms of tokens, which is important for token-based processing limits. The token length is calculated using the tokenizer specified in onnxEmbeddingModel
.chunkPrefix
: String (optional, default null
) - A prefix to add to each chunk (e.g., "search_document: "). This is particularly useful when using embedding models that are trained with specific task prefixes, like the nomic-embed-text-v1.5 model. The prefix is added before calculating embeddings or token lengths.excludeChunkPrefixInResults
: Boolean (optional, default false
) - If set to true
, the chunk prefix will be removed from the results. This is useful when you want to remove the prefix from the results while still maintaining the prefix for embedding calculations.Basic usage:
import { sentenceit } from 'semantic-chunking';
let duckText = "A duck waddles into a bakery and quacks to the baker, \"I'll have a loaf of bread, please.\" The baker, amused, quickly wraps the loaf and hands it over. The duck takes a nibble, looks around, and then asks, \"Do you have any seeds to go with this?\" The baker, chuckling, replies, \"Sorry, we're all out of seeds today.\" The duck nods and continues nibbling on its bread, clearly unfazed by the lack of seed toppings. Just another day in the life of a bread-loving waterfowl! 🦆🍞";
// initialize documents array and add the duck text to it
let documents = [];
documents.push({
document_name: "duck document",
document_text: duckText
});
// call the sentenceit function passing in the documents array and the options object
async function main() {
let myDuckChunks = await sentenceit(documents, { returnEmbedding: true });
console.log("myDuckChunks", myDuckChunks);
}
main();
Look at the example\example-sentenceit.js
file in the root of this project for a more complex example of using all the optional parameters.
Fill out the tools/download-models.list.json
file with a list of models you want pre-downloaded, and their precisions (See the Curated ONNX Embedding Models section above for a list of models to try). It is pre-populated with the list above; remove any models you don't want to download.
Run the npm run download-models
command to download the models to the models
directory.
If you are using this library for a RAG application, consider using the chunkPrefix
option to add a prefix to each chunk. This can help improve the quality of the embeddings and reduce the amount of context needed to be passed to the LLM for embedding models that support task prefixes.
Chunk your large document like this:
const largeDocumentText = await fs.promises.readFile('./large-document.txt', 'utf8');
const documents = [
{
document_name: "large document",
document_text: largeDocumentText
}
];
const myDocumentChunks = await chunkit(documents, { chunkPrefix: "search_document", returnEmbedding: true });
Get your search queries ready like this (use cramit for a quick large chunk):
const documents = [
{ document_text: "What is the capital of France?" }
];
const mySearchQueryChunk = await cramit(documents, { chunkPrefix: "search_query", returnEmbedding: true });
Now you can use the myDocumentChunks
and mySearchQueryChunk
results in your RAG application, feed them to a vector database, or find the closest match using cosine similarity in memory. The possibilities are many!
Happy Chunking!
If you enjoy this library please consider sending me a tip to support my work 😀
FAQs
Semantically create chunks from large texts. Useful for workflows involving large language models (LLMs).
We found that semantic-chunking demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket researchers uncover a malicious npm package posing as a tool for detecting vulnerabilities in Etherium smart contracts.
Security News
Research
A supply chain attack on Rspack's npm packages injected cryptomining malware, potentially impacting thousands of developers.
Research
Security News
Socket researchers discovered a malware campaign on npm delivering the Skuld infostealer via typosquatted packages, exposing sensitive data.