code-splitter Python Bindings
The code-splitter
Python package provides bindings for the code-splitter Rust crate. It leverages the tree-sitter parsing library and tokenizers to split code into semantically meaningful chunks. This functionality is particularly useful in Retrieval Augmented Generation (RAG), a technique that enhances the generation capabilities of Large Language Models (LLMs) by leveraging external knowledge sources.
Installation
You can install the package from PyPI:
pip install code-splitter
Usage
Here's an example of how to use the package:
from code_splitter import Language, CharSplitter
with open("example.py", "rb") as f:
code = f.read()
splitter = CharSplitter(Language.Python, max_size=200)
chunks = splitter.split(code)
for chunk in chunks:
print(f"Start: {chunk.start}, End: {chunk.end}, Size: {chunk.size}")
print(chunk.text)
print()
This example uses the CharSplitter
to split Python code into chunks of maximum 200 characters. The Chunk
objects contain information about the start and end lines, size, and the actual text of the chunk.
Available Splitters
The package provides the following splitters:
CharSplitter
: Splits code based on character count.WordSplitter
: Splits code based on word count.TiktokenSplitter
: Splits code based on Tiktoken tokenizer.HuggingfaceSplitter
: Splits code based on HuggingFace tokenizers.
Supported Languages
The following programming languages are currently supported:
Examples
Here are some examples of splitting code using different splitters and languages:
Split Python Code by Characters
from code_splitter import Language, CharSplitter
splitter = CharSplitter(Language.Python, max_size=200)
chunks = splitter.split(code)
Split Markdown by Words
from code_splitter import Language, WordSplitter
splitter = WordSplitter(Language.Markdown, max_size=50)
chunks = splitter.split(code)
Split Rust Code by Tiktoken Tokenizer
from code_splitter import Language, TiktokenSplitter
splitter = TiktokenSplitter(Language.Rust, max_size=100)
chunks = splitter.split(code)
Split Go Code by HuggingFace Tokenizer
from code_splitter import Language, HuggingfaceSplitter
splitter = HuggingfaceSplitter(Language.Golang, max_size=100, pretrained_model_name_or_path="bert-base-cased")
chunks = splitter.split(code)
For more examples, please refer to the tests directory in the repository.
Contributing
Contributions are welcome! Please feel free to submit issues or pull requests.