Kreuzberg

A document intelligence framework for Python. Extract text, metadata, and structured information from diverse document formats through a unified, extensible API. Built on established open source foundations including Pandoc, PDFium, and Tesseract.
📖 Complete Documentation
Framework Overview
Document Intelligence Capabilities
- Text Extraction: High-fidelity text extraction preserving document structure and formatting
- Metadata Extraction: Comprehensive metadata including author, creation date, language, and document properties
- Format Support: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
- OCR Integration: Multiple OCR engines (Tesseract, EasyOCR, PaddleOCR) with automatic fallback
- Table Detection: Structured table extraction with cell-level precision via GMFT integration
- Document Classification: Automatic document type detection (contracts, forms, invoices, receipts, reports)
Technical Architecture
- Performance: Highest throughput among Python document processing frameworks (30+ docs/second)
- Resource Efficiency: 71MB installation, ~360MB runtime memory footprint
- Extensibility: Plugin architecture for custom extractors via the Extractor base class
- API Design: Synchronous and asynchronous APIs with consistent interfaces
- Type Safety: Complete type annotations throughout the codebase
Open Source Foundation
Kreuzberg leverages established open source technologies:
- Pandoc: Universal document converter for robust format support
- PDFium: Google's PDF rendering engine for accurate PDF processing
- Tesseract: Google's OCR engine for text recognition
- Python-docx/pptx: Native Microsoft Office format support
Quick Start
uvx kreuzberg extract document.pdf > output.md
uvx --from "kreuzberg[all]" kreuzberg extract invoice.pdf --ocr --format markdown
uvx kreuzberg extract report.pdf --show-metadata --format json
Python Usage
Async (recommended for web apps):
from kreuzberg import extract_file
result = await extract_file("presentation.pptx")
print(result.content)
print(f"Title: {result.metadata.title}")
print(f"Author: {result.metadata.author}")
print(f"Page count: {result.metadata.page_count}")
print(f"Created: {result.metadata.created_at}")
Sync (for scripts and CLI tools):
from kreuzberg import extract_file_sync
result = extract_file_sync("report.docx")
print(result.content)
print(f"Language: {result.metadata.language}")
print(f"Word count: {result.metadata.word_count}")
print(f"Keywords: {result.metadata.keywords}")
Docker
docker run -p 8000:8000 goldziher/kreuzberg
curl -X POST -F "file=@document.pdf" http://localhost:8000/extract
📖 Installation Guide • CLI Documentation • API Reference
Deployment Options
🤖 MCP Server (AI Integration)
Add to Claude Desktop with one command:
claude mcp add kreuzberg uvx -- --from "kreuzberg[all]" kreuzberg-mcp
Or configure manually in claude_desktop_config.json
:
{
"mcpServers": {
"kreuzberg": {
"command": "uvx",
"args": ["--from", "kreuzberg[all]", "kreuzberg-mcp"]
}
}
}
MCP capabilities:
- Extract text from PDFs, images, Office docs, and more
- Full OCR support with multiple engines
- Table extraction and metadata parsing
📖 MCP Documentation
Supported Formats
Documents | PDF, DOCX, DOC, RTF, TXT, EPUB |
Images | JPG, PNG, TIFF, BMP, GIF, WEBP |
Spreadsheets | XLSX, XLS, CSV, ODS |
Presentations | PPTX, PPT, ODP |
Web | HTML, XML, MHTML |
Archives | Support via extraction |
📊 Performance Characteristics
View comprehensive benchmarks • Benchmark methodology • Detailed Analysis
Technical Specifications
Throughput (tiny files) | 31.78 files/s | 23.94 files/s | Highest throughput |
Throughput (small files) | 8.91 files/s | 9.31 files/s | Highest throughput |
Memory footprint | 359.8 MB | 395.2 MB | Lowest usage |
Installation size | 71 MB | 71 MB | Smallest size |
Success rate | 100% | 100% | Perfect |
Supported formats | 18 | 18 | Comprehensive |
Architecture Advantages
- Native C extensions: Built on PDFium and Tesseract for maximum performance
- Async/await support: True asynchronous processing with intelligent task scheduling
- Memory efficiency: Streaming architecture minimizes memory allocation
- Process pooling: Automatic multiprocessing for CPU-intensive operations
- Optimized data flow: Efficient data handling with minimal transformations
Benchmark details: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.
Documentation
Quick Links
License
MIT License - see LICENSE for details.