You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP
Socket
Book a DemoInstallSign in
Socket

kreuzberg

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

kreuzberg

Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats

3.9.0
pipPyPI
Maintainers
1

Kreuzberg

Discord PyPI version Documentation Benchmarks License: MIT DeepSource

A document intelligence framework for Python. Extract text, metadata, and structured information from diverse document formats through a unified, extensible API. Built on established open source foundations including Pandoc, PDFium, and Tesseract.

📖 Complete Documentation

Framework Overview

Document Intelligence Capabilities

  • Text Extraction: High-fidelity text extraction preserving document structure and formatting
  • Metadata Extraction: Comprehensive metadata including author, creation date, language, and document properties
  • Format Support: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
  • OCR Integration: Multiple OCR engines (Tesseract, EasyOCR, PaddleOCR) with automatic fallback
  • Table Detection: Structured table extraction with cell-level precision via GMFT integration
  • Document Classification: Automatic document type detection (contracts, forms, invoices, receipts, reports)

Technical Architecture

  • Performance: Highest throughput among Python document processing frameworks (30+ docs/second)
  • Resource Efficiency: 71MB installation, ~360MB runtime memory footprint
  • Extensibility: Plugin architecture for custom extractors via the Extractor base class
  • API Design: Synchronous and asynchronous APIs with consistent interfaces
  • Type Safety: Complete type annotations throughout the codebase

Open Source Foundation

Kreuzberg leverages established open source technologies:

  • Pandoc: Universal document converter for robust format support
  • PDFium: Google's PDF rendering engine for accurate PDF processing
  • Tesseract: Google's OCR engine for text recognition
  • Python-docx/pptx: Native Microsoft Office format support

Quick Start

Extract Text with CLI

# Extract text from any file to markdown
uvx kreuzberg extract document.pdf > output.md

# With all features (OCR, table extraction, etc.)
uvx --from "kreuzberg[all]" kreuzberg extract invoice.pdf --ocr --format markdown

# Extract with rich metadata
uvx kreuzberg extract report.pdf --show-metadata --format json

Python Usage

Async (recommended for web apps):

from kreuzberg import extract_file

# In your async function
result = await extract_file("presentation.pptx")
print(result.content)

# Rich metadata extraction
print(f"Title: {result.metadata.title}")
print(f"Author: {result.metadata.author}")
print(f"Page count: {result.metadata.page_count}")
print(f"Created: {result.metadata.created_at}")

Sync (for scripts and CLI tools):

from kreuzberg import extract_file_sync

result = extract_file_sync("report.docx")
print(result.content)

# Access rich metadata
print(f"Language: {result.metadata.language}")
print(f"Word count: {result.metadata.word_count}")
print(f"Keywords: {result.metadata.keywords}")

Docker

# Run the REST API
docker run -p 8000:8000 goldziher/kreuzberg

# Extract via API
curl -X POST -F "file=@document.pdf" http://localhost:8000/extract

📖 Installation GuideCLI DocumentationAPI Reference

Deployment Options

🤖 MCP Server (AI Integration)

Add to Claude Desktop with one command:

claude mcp add kreuzberg uvx -- --from "kreuzberg[all]" kreuzberg-mcp

Or configure manually in claude_desktop_config.json:

{
  "mcpServers": {
    "kreuzberg": {
      "command": "uvx",
      "args": ["--from", "kreuzberg[all]", "kreuzberg-mcp"]
    }
  }
}

MCP capabilities:

  • Extract text from PDFs, images, Office docs, and more
  • Full OCR support with multiple engines
  • Table extraction and metadata parsing

📖 MCP Documentation

Supported Formats

CategoryFormats
DocumentsPDF, DOCX, DOC, RTF, TXT, EPUB
ImagesJPG, PNG, TIFF, BMP, GIF, WEBP
SpreadsheetsXLSX, XLS, CSV, ODS
PresentationsPPTX, PPT, ODP
WebHTML, XML, MHTML
ArchivesSupport via extraction

📊 Performance Characteristics

View comprehensive benchmarksBenchmark methodologyDetailed Analysis

Technical Specifications

MetricKreuzberg SyncKreuzberg AsyncBenchmarked
Throughput (tiny files)31.78 files/s23.94 files/sHighest throughput
Throughput (small files)8.91 files/s9.31 files/sHighest throughput
Memory footprint359.8 MB395.2 MBLowest usage
Installation size71 MB71 MBSmallest size
Success rate100%100%Perfect
Supported formats1818Comprehensive

Architecture Advantages

  • Native C extensions: Built on PDFium and Tesseract for maximum performance
  • Async/await support: True asynchronous processing with intelligent task scheduling
  • Memory efficiency: Streaming architecture minimizes memory allocation
  • Process pooling: Automatic multiprocessing for CPU-intensive operations
  • Optimized data flow: Efficient data handling with minimal transformations

Benchmark details: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.

Documentation

License

MIT License - see LICENSE for details.

Keywords

async

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts