
Security News
Open Source Maintainers Feeling the Weight of the EU’s Cyber Resilience Act
The EU Cyber Resilience Act is prompting compliance requests that open source maintainers may not be obligated or equipped to handle.
semantic-text-splitter
Advanced tools
Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python.
Large language models (LLMs) can be used for many tasks, but often have a limited context size that can be smaller than documents you might want to use. To use documents of larger length, you often have to split your text into chunks to fit within this context size.
This crate provides methods for splitting longer pieces of text into smaller chunks, aiming to maximize a desired chunk size, but still splitting at semantically sensible boundaries whenever possible.
from semantic_text_splitter import TextSplitter
# Maximum number of characters in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
splitter = TextSplitter(max_characters)
# splitter = TextSplitter(max_characters, trim=False)
chunks = splitter.chunks("your document text")
You also have the option of specifying your chunk capacity as a range.
Once a chunk has reached a length that falls within the range it will be returned.
It is always possible that a chunk may be returned that is less than the start
value, as adding the next piece of text may have made it larger than the end
capacity.
from semantic_text_splitter import TextSplitter
# Maximum number of characters in a chunk. Will fill up the
# chunk until it is somewhere in this range.
splitter = TextSplitter((200,1000))
chunks = splitter.chunks("your document text")
from semantic_text_splitter import TextSplitter
from tokenizers import Tokenizer
# Maximum number of tokens in a chunk
max_tokens = 1000
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = TextSplitter.from_huggingface_tokenizer(tokenizer, max_tokens)
chunks = splitter.chunks("your document text")
from semantic_text_splitter import TextSplitter
# Maximum number of tokens in a chunk
max_tokens = 1000
splitter = TextSplitter.from_tiktoken_model("gpt-3.5-turbo", max_tokens)
chunks = splitter.chunks("your document text")
from semantic_text_splitter import TextSplitter
splitter = TextSplitter.from_callback(lambda text: len(text), 1000)
chunks = splitter.chunks("your document text")
All of the above examples also can also work with Markdown text. You can use the MarkdownSplitter
in the same ways as the TextSplitter
.
from semantic_text_splitter import MarkdownSplitter
# Maximum number of characters in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
splitter = MarkdownSplitter(max_characters)
# splitter = MarkdownSplitter(max_characters, trim=False)
chunks = splitter.chunks("# Header\n\nyour document text")
To preserve as much semantic meaning within a chunk as possible, each chunk is composed of the largest semantic units that can fit in the next given chunk. For each splitter type, there is a defined set of semantic levels. Here is an example of the steps used:
The boundaries used to split the text if using the chunks
method, in ascending order:
TextSplitter
Semantic Levels\r\n
, \n
, or \r
) Each unique length of consecutive newline sequences is treated as its own semantic level. So a sequence of 2 newlines is a higher level than a sequence of 1 newline, and so on.Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.
MarkdownSplitter
Semantic LevelsMarkdown is parsed according to the CommonMark
spec, along with some optional features such as GitHub Flavored Markdown.
Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.
There are lots of methods of determining sentence breaks, all to varying degrees of accuracy, and many requiring ML models to do so. Rather than trying to find the perfect sentence breaks, we rely on unicode method of sentence boundaries, which in most cases is good enough for finding a decent semantic breaking point if a paragraph is too large, and avoids the performance penalties of many other methods.
This crate was inspired by LangChain's TextSplitter. But, looking into the implementation, there was potential for better performance as well as better semantic chunking.
A big thank you to the Unicode team for their icu_segmenter crate that manages a lot of the complexity of matching the Unicode rules for words and sentences.
FAQs
Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python.
We found that semantic-text-splitter demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The EU Cyber Resilience Act is prompting compliance requests that open source maintainers may not be obligated or equipped to handle.
Security News
Crates.io adds Trusted Publishing support, enabling secure GitHub Actions-based crate releases without long-lived API tokens.
Research
/Security News
Undocumented protestware found in 28 npm packages disrupts UI for Russian-language users visiting Russian and Belarusian domains.