You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP →

Book a Demo Install Sign in

kreuzberg

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

kreuzberg

Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats

3.9.0

PyPI

Maintainers: 1

Kreuzberg

A document intelligence framework for Python. Extract text, metadata, and structured information from diverse document formats through a unified, extensible API. Built on established open source foundations including Pandoc, PDFium, and Tesseract.

📖 Complete Documentation

Framework Overview

Document Intelligence Capabilities

Text Extraction: High-fidelity text extraction preserving document structure and formatting
Metadata Extraction: Comprehensive metadata including author, creation date, language, and document properties
Format Support: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
OCR Integration: Multiple OCR engines (Tesseract, EasyOCR, PaddleOCR) with automatic fallback
Table Detection: Structured table extraction with cell-level precision via GMFT integration
Document Classification: Automatic document type detection (contracts, forms, invoices, receipts, reports)

Technical Architecture

Performance: Highest throughput among Python document processing frameworks (30+ docs/second)
Resource Efficiency: 71MB installation, ~360MB runtime memory footprint
Extensibility: Plugin architecture for custom extractors via the Extractor base class
API Design: Synchronous and asynchronous APIs with consistent interfaces
Type Safety: Complete type annotations throughout the codebase

Open Source Foundation

Kreuzberg leverages established open source technologies:

Pandoc: Universal document converter for robust format support
PDFium: Google's PDF rendering engine for accurate PDF processing
Tesseract: Google's OCR engine for text recognition
Python-docx/pptx: Native Microsoft Office format support

Quick Start

Extract Text with CLI

# Extract text from any file to markdown
uvx kreuzberg extract document.pdf > output.md

# With all features (OCR, table extraction, etc.)
uvx --from "kreuzberg[all]" kreuzberg extract invoice.pdf --ocr --format markdown

# Extract with rich metadata
uvx kreuzberg extract report.pdf --show-metadata --format json

Python Usage

Async (recommended for web apps):

from kreuzberg import extract_file

# In your async function
result = await extract_file("presentation.pptx")
print(result.content)

# Rich metadata extraction
print(f"Title: {result.metadata.title}")
print(f"Author: {result.metadata.author}")
print(f"Page count: {result.metadata.page_count}")
print(f"Created: {result.metadata.created_at}")

Sync (for scripts and CLI tools):

from kreuzberg import extract_file_sync

result = extract_file_sync("report.docx")
print(result.content)

# Access rich metadata
print(f"Language: {result.metadata.language}")
print(f"Word count: {result.metadata.word_count}")
print(f"Keywords: {result.metadata.keywords}")

Docker

# Run the REST API
docker run -p 8000:8000 goldziher/kreuzberg

# Extract via API
curl -X POST -F "file=@document.pdf" http://localhost:8000/extract

📖 Installation Guide • CLI Documentation • API Reference

Deployment Options

🤖 MCP Server (AI Integration)

Add to Claude Desktop with one command:

claude mcp add kreuzberg uvx -- --from "kreuzberg[all]" kreuzberg-mcp

Or configure manually in claude_desktop_config.json:

{
  "mcpServers": {
    "kreuzberg": {
      "command": "uvx",
      "args": ["--from", "kreuzberg[all]", "kreuzberg-mcp"]
    }
  }
}

MCP capabilities:

Extract text from PDFs, images, Office docs, and more
Full OCR support with multiple engines
Table extraction and metadata parsing

📖 MCP Documentation

Supported Formats

Category	Formats
Documents	PDF, DOCX, DOC, RTF, TXT, EPUB
Images	JPG, PNG, TIFF, BMP, GIF, WEBP
Spreadsheets	XLSX, XLS, CSV, ODS
Presentations	PPTX, PPT, ODP
Web	HTML, XML, MHTML
Archives	Support via extraction

📊 Performance Characteristics

View comprehensive benchmarks • Benchmark methodology • Detailed Analysis

Technical Specifications

Metric	Kreuzberg Sync	Kreuzberg Async	Benchmarked
Throughput (tiny files)	31.78 files/s	23.94 files/s	Highest throughput
Throughput (small files)	8.91 files/s	9.31 files/s	Highest throughput
Memory footprint	359.8 MB	395.2 MB	Lowest usage
Installation size	71 MB	71 MB	Smallest size
Success rate	100%	100%	Perfect
Supported formats	18	18	Comprehensive

Architecture Advantages

Native C extensions: Built on PDFium and Tesseract for maximum performance
Async/await support: True asynchronous processing with intelligent task scheduling
Memory efficiency: Streaming architecture minimizes memory allocation
Process pooling: Automatic multiprocessing for CPU-intensive operations
Optimized data flow: Efficient data handling with minimal transformations

Benchmark details: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.

Documentation

License

MIT License - see LICENSE for details.

Keywords

async

document-analysis

document-classification

document-intelligence

document-processing

extensible

FAQs

What is kreuzberg?

Is kreuzberg well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

kreuzberg

Kreuzberg

Framework Overview

Document Intelligence Capabilities

Technical Architecture

Open Source Foundation

Quick Start

Extract Text with CLI

Python Usage

Docker

Deployment Options

🤖 MCP Server (AI Integration)

Supported Formats

📊 Performance Characteristics

Technical Specifications

Architecture Advantages

Documentation

Quick Links

License

Keywords

Related posts

kreuzberg

Kreuzberg

Framework Overview

Document Intelligence Capabilities

Technical Architecture

Open Source Foundation

Quick Start

Extract Text with CLI

Python Usage

Docker

Deployment Options

🤖 MCP Server (AI Integration)

Supported Formats

📊 Performance Characteristics

Technical Specifications

Architecture Advantages

Documentation

Quick Links

License

Keywords

Related posts

npm Phishing Email Targets Developers with Typosquatted Domain

Knip Hits 500 Releases with v5.62.0, Improving TypeScript Config Detection and Plugin Integrations