
Research
2025 Report: Destructive Malware in Open Source Packages
Destructive malware is rising across open source registries, using delays and kill switches to wipe code, break builds, and disrupt CI/CD.
crabparser
Advanced tools
🦀 Blazingly fast text parsing library with Rust backend - 10x faster than pure Python with support for PDF, DOCX, CSV and 12+ programming languages
High-performance text parsing library written in Rust with Python bindings
CrabParser is a blazingly fast text parsing library that splits documents and code files into semantic chunks. Built with Rust for maximum performance and memory efficiency, it provides Python bindings for easy integration into your projects.
pip install crabparser
from crabparser import TextParser, ChunkedText
# Create a parser instance
parser = TextParser(
chunk_size=1000, # Maximum characters per chunk
respect_paragraphs=True, # Keep paragraphs together
respect_sentences=True # Split at sentence boundaries
)
# Parse text
text = "Your long document text here..."
chunks = parser.parse(text)
print(f"Split into {len(chunks)} chunks")
# Parse with memory-efficient ChunkedText
chunked = parser.parse_chunked(text)
print(f"First chunk: {chunked[0]}")
print(f"Total size: {chunked.total_size} bytes")
# Parse files directly (auto-detects format)
chunks = parser.parse_file("document.pdf") # Works with PDF, DOCX, CSV, TXT, and code files
# Save chunks to files
parser.save_chunks(chunks, "output_dir", "document")
Semantic code parsing that respects function and class boundaries:
The main parser class for processing text and files.
parser = TextParser(
chunk_size=1000, # Maximum size of each chunk
respect_paragraphs=True, # Maintain paragraph boundaries
respect_sentences=True # Maintain sentence boundaries
)
Methods:
parse(text: str) -> List[str] - Parse text into chunksparse_chunked(text: str) -> ChunkedText - Memory-efficient parsingparse_file(path: str) -> List[str] - Parse any supported fileparse_file_chunked(path: str) -> ChunkedText - Memory-efficient file parsingsave_chunks(chunks, output_dir, base_name) -> int - Save chunks to filesMemory-efficient container that keeps chunks in Rust memory.
# Access chunks without loading all into Python memory
chunked[0] # First chunk
chunked[-1] # Last chunk
len(chunked) # Number of chunks
chunked.total_size # Total size in bytes
chunked.source_file # Source file path (if applicable)
# Iteration
for chunk in chunked:
process(chunk)
# Get slice of chunks
batch = chunked.get_slice(0, 10) # Get first 10 chunks
from crabparser import TextParser
parser = TextParser(chunk_size=2000)
# Parse a large PDF file
chunks = parser.parse_file("research_paper.pdf")
# Process chunks
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {chunk[:100]}...")
# Parse Python code while respecting function boundaries
parser = TextParser(
chunk_size=1500,
respect_paragraphs=True # Keeps functions/classes together
)
chunks = parser.parse_file("main.py")
from pathlib import Path
from crabparser import TextParser
parser = TextParser(chunk_size=1000)
output_base = Path("output")
for file_path in Path("documents").glob("*.pdf"):
# Use memory-efficient parsing
chunked = parser.parse_file_chunked(str(file_path))
# Process without loading all chunks into memory
for i in range(len(chunked)):
chunk = chunked[i] # Only loads this chunk
# Process chunk...
# Save results
parser.save_chunks(chunked, str(output_base), file_path.stem)
CrabParser is designed for speed and efficiency:
MIT License - see the LICENSE file for details.
Made with 🦀 and ❤️ by the open-source community
FAQs
🦀 Blazingly fast text parsing library with Rust backend - 10x faster than pure Python with support for PDF, DOCX, CSV and 12+ programming languages
We found that crabparser demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Research
Destructive malware is rising across open source registries, using delays and kill switches to wipe code, break builds, and disrupt CI/CD.

Security News
Socket CTO Ahmad Nassri shares practical AI coding techniques, tools, and team workflows, plus what still feels noisy and why shipping remains human-led.

Research
/Security News
A five-month operation turned 27 npm packages into durable hosting for browser-run lures that mimic document-sharing portals and Microsoft sign-in, targeting 25 organizations across manufacturing, industrial automation, plastics, and healthcare for credential theft.