
Research
/Security News
Contagious Interview Campaign Escalates With 67 Malicious npm Packages and New Malware Loader
North Korean threat actors deploy 67 malicious npm packages using the newly discovered XORIndex malware loader.
A comprehensive document chunking library with context generation for RAG applications
A comprehensive Python library for document chunking with intelligent context generation, designed specifically for RAG (Retrieval-Augmented Generation) applications.
pip install contextual-chunker
from document_chunker import DocumentChunker, create_chunking_config
# Create configuration
config = create_chunking_config(
openai_api_key="your-openai-api-key",
chunk_size=1500,
chunk_overlap=100,
chunking_strategy="recursive",
save_contexts=True
)
# Initialize chunker
chunker = DocumentChunker(config)
# Process PDF files
results = chunker.process_pdf_files(["document.pdf"])
# Or process a directory
results = chunker.process_directory("./documents")
# Save results
output_file = chunker.save_results(results)
print(f"Results saved to: {output_file}")
# Process a single PDF file
document-chunker document.pdf --chunk-size 1000 --output-dir ./output
# Process a directory with custom settings
document-chunker ./documents --strategy semantic --chunk-size 1500 --save-txt
# Process without context generation
document-chunker ./documents --no-context --chunk-size 800
openai_api_key
: Your OpenAI API key (required for context generation)chunk_size
: Maximum size of each chunk in characters (default: 1000)chunk_overlap
: Overlap between chunks in characters (default: 200)chunking_strategy
: "recursive" or "semantic" (default: "recursive")save_contexts
: Enable AI context generation (default: True)context_model
: OpenAI model for context generation (default: "gpt-4o-mini")parallel_threads
: Number of threads for parallel processing (default: 5)output_dir
: Directory for output files (default: "./chunked_documents")Splits text using a hierarchy of separators (paragraphs → sentences → words → characters) while respecting chunk size limits.
Preserves semantic meaning by splitting on paragraph and sentence boundaries first, ensuring coherent chunks.
The library can automatically generate contextual information for each chunk using OpenAI's models. This context helps improve retrieval accuracy in RAG applications by providing additional information about where each chunk fits within the larger document.
Structured output containing:
Simple text file with all chunks for easy review and debugging.
document-chunker [OPTIONS] INPUT
Arguments:
INPUT Input file or directory path
Options:
-o, --output-dir TEXT Output directory (default: ./chunked_documents)
-s, --chunk-size INT Chunk size in characters (default: 1000)
-p, --chunk-overlap INT Chunk overlap in characters (default: 200)
-t, --strategy CHOICE Chunking strategy: recursive|semantic (default: recursive)
--no-context Disable context generation
--context-model TEXT OpenAI model for context (default: gpt-4o-mini)
-j, --threads INT Parallel threads (default: 5)
-e, --extensions LIST File extensions to process (default: .pdf .txt .md)
--save-txt Also save chunks to text file
-r, --recursive Process directories recursively
Set your OpenAI API key:
export OPENAI_API_KEY="your-api-key-here"
Or create a .env
file:
OPENAI_API_KEY=your-api-key-here
from document_chunker import DocumentChunker, create_chunking_config
config = create_chunking_config(
chunk_size=1000,
save_contexts=False # Disable context generation
)
chunker = DocumentChunker(config)
results = chunker.process_pdf_files(["research_paper.pdf"])
chunker.save_results(results)
config = create_chunking_config(
openai_api_key="sk-...",
chunk_size=1500,
chunk_overlap=150,
chunking_strategy="semantic",
context_model="gpt-4",
parallel_threads=8,
save_contexts=True
)
chunker = DocumentChunker(config)
results = chunker.process_directory("./research_papers", recursive=True)
output_file = chunker.save_results(results)
# Also save as text file
from document_chunker import save_chunks_to_txt
save_chunks_to_txt(output_file, "chunks.txt")
MIT License
Contributions are welcome! Please feel free to submit a Pull Request.
FAQs
A comprehensive document chunking library with context generation for RAG applications
We found that contextual-chunker demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
North Korean threat actors deploy 67 malicious npm packages using the newly discovered XORIndex malware loader.
Security News
Meet Socket at Black Hat & DEF CON 2025 for 1:1s, insider security talks at Allegiant Stadium, and a private dinner with top minds in software supply chain security.
Security News
CAI is a new open source AI framework that automates penetration testing tasks like scanning and exploitation up to 3,600× faster than humans.