PDF Oxide - The Fastest PDF Toolkit for Python, Rust, Go, JS/TS, C#, WASM, CLI & AI
More language bindings coming in May 2026. Java, Ruby, PHP, Swift, and Kotlin are on the roadmap. Want another language? Open an issue and tell us.
The fastest PDF library for text extraction, image extraction, and markdown conversion. Rust core with bindings for Python, Go, JavaScript / TypeScript, C# / .NET, and WASM, plus a CLI tool and MCP server for AI assistants. 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf. 100% pass rate on 3,830 real-world PDFs. MIT licensed.

New in v0.3.24 — now available in Go, JavaScript / TypeScript, and C# / .NET, alongside the existing Python, Rust, and WASM bindings.
Same Rust core, same 0.8 ms extraction speed, same 100% pass rate.
See the language guides: Python · Go · JavaScript / TypeScript · C# / .NET · WASM
Quick Start
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
text = doc.extract_text(0)
chars = doc.extract_chars(0)
markdown = doc.to_markdown(0, detect_headings=True)
pip install pdf_oxide
Rust
use pdf_oxide::PdfDocument;
let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text(0)?;
let images = doc.extract_images(0)?;
let markdown = doc.to_markdown(0, Default::default())?;
[dependencies]
pdf_oxide = "0.3"
CLI
pdf-oxide text document.pdf
pdf-oxide markdown document.pdf -o output.md
pdf-oxide search document.pdf "pattern"
pdf-oxide merge a.pdf b.pdf -o combined.pdf
brew install yfedoseev/tap/pdf-oxide
MCP Server (for AI assistants)
brew install yfedoseev/tap/pdf-oxide
{
"mcpServers": {
"pdf-oxide": { "command": "crgx", "args": ["pdf_oxide_mcp@latest"] }
}
}
Why pdf_oxide?
- Fast — 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf, 29× faster than pdfplumber
- Reliable — 100% pass rate on 3,830 test PDFs, zero panics, zero timeouts
- Complete — Text extraction, image extraction, PDF creation, and editing in one library
- Multi-platform — Rust, Python, Go, JavaScript/TypeScript, C#/.NET, WASM, CLI, and MCP server for AI assistants
- Permissive license — MIT / Apache-2.0 — use freely in commercial and open-source projects
Performance
Benchmarked on 3,830 PDFs from three independent public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). Text extraction libraries only (no OCR). Single-thread, 60s timeout, no warm-up.
Python Libraries
| PDF Oxide | 0.8ms | 9ms | 100% | MIT |
| PyMuPDF | 4.6ms | 28ms | 99.3% | AGPL-3.0 |
| pypdfium2 | 4.1ms | 42ms | 99.2% | Apache-2.0 |
| pymupdf4llm | 55.5ms | 280ms | 99.1% | AGPL-3.0 |
| pdftext | 7.3ms | 82ms | 99.0% | GPL-3.0 |
| pdfminer | 16.8ms | 124ms | 98.8% | MIT |
| pdfplumber | 23.2ms | 189ms | 98.8% | MIT |
| markitdown | 108.8ms | 378ms | 98.6% | MIT |
| pypdf | 12.1ms | 97ms | 98.4% | BSD-3 |
Rust Libraries
| PDF Oxide | 0.8ms | 9ms | 100% | Built-in |
| oxidize_pdf | 13.5ms | 11ms | 99.1% | Basic |
| unpdf | 2.8ms | 10ms | 95.1% | Basic |
| pdf_extract | 4.08ms | 37ms | 91.5% | Basic |
| lopdf | 0.3ms | 2ms | 80.2% | No built-in extraction |
Text Quality
99.5% text parity vs PyMuPDF and pypdfium2 across the full corpus. PDF Oxide extracts text from 7–10× more "hard" files than it misses vs any competitor.
Corpus
100% pass rate on all valid PDFs — the 7 non-passing files across the corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams).
Features
| Text & Layout | Documents | Annotations |
| Images | Tables | Form Fields |
| Forms | Graphics | Bookmarks |
| Annotations | Templates | Links |
| Bookmarks | Images | Content |
Python API
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
print(f"Pages: {doc.page_count()}")
print(f"Version: {doc.version()}")
header = doc.within(0, (0, 700, 612, 92)).extract_text()
words = doc.extract_words(0)
for w in words:
print(f"{w.text} at {w.bbox}")
words = doc.extract_words(0, word_gap_threshold=2.5)
lines = doc.extract_text_lines(0)
for line in lines:
print(f"Line: {line.text}")
lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)
params = doc.page_layout_params(0)
print(f"word gap: {params.word_gap_threshold:.1f}, line gap: {params.line_gap_threshold:.1f}")
from pdf_oxide import ExtractionProfile
words = doc.extract_words(0, profile=ExtractionProfile.form())
lines = doc.extract_text_lines(0, profile=ExtractionProfile.academic())
tables = doc.extract_tables(0)
for table in tables:
print(f"Table with {table.row_count} rows")
text = doc.extract_text(0)
chars = doc.extract_chars(0)
Form Fields
fields = doc.get_form_fields()
for f in fields:
print(f"{f.name} ({f.field_type}) = {f.value}")
doc.set_form_field_value("employee_name", "Jane Doe")
doc.set_form_field_value("wages", "85000.00")
doc.save("filled.pdf")
Rust API
use pdf_oxide::PdfDocument;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text(0)?;
let chars = doc.extract_chars(0)?;
let images = doc.extract_images(0)?;
let paths = doc.extract_paths(0)?;
Ok(())
}
Form Fields (Rust)
use pdf_oxide::editor::{DocumentEditor, EditableDocument, SaveOptions};
use pdf_oxide::editor::form_fields::FormFieldValue;
let mut editor = DocumentEditor::open("w2.pdf")?;
editor.set_form_field_value("employee_name", FormFieldValue::Text("Jane Doe".into()))?;
editor.save_with_options("filled.pdf", SaveOptions::incremental())?;
Installation
Python
pip install pdf_oxide
Wheels available for Linux, macOS, and Windows. Python 3.8–3.14.
Rust
[dependencies]
pdf_oxide = "0.3"
JavaScript/WASM
npm install pdf-oxide-wasm
const { WasmPdfDocument } = require("pdf-oxide-wasm");
CLI
brew install yfedoseev/tap/pdf-oxide
cargo install pdf_oxide_cli
cargo binstall pdf_oxide_cli
MCP Server
brew install yfedoseev/tap/pdf-oxide
cargo install pdf_oxide_mcp
Other languages
- Go —
go get github.com/yfedoseev/pdf_oxide/go — see go/README.md
- JavaScript / TypeScript (Node.js) —
npm install pdf-oxide — see js/README.md
- C# / .NET —
dotnet add package PdfOxide — see csharp/README.md
All three share the same Rust core as the Python and WASM bindings, so everything you read in this README applies to them as well — just with each language's native naming conventions.
CLI
22 commands for PDF processing directly from your terminal:
pdf-oxide text report.pdf
pdf-oxide markdown report.pdf -o report.md
pdf-oxide html report.pdf -o report.html
pdf-oxide info report.pdf
pdf-oxide search report.pdf "neural.?network"
pdf-oxide images report.pdf -o ./images/
pdf-oxide merge a.pdf b.pdf -o combined.pdf
pdf-oxide split report.pdf -o ./pages/
pdf-oxide watermark doc.pdf "DRAFT"
pdf-oxide forms w2.pdf --fill "name=Jane"
Run pdf-oxide with no arguments for interactive REPL mode. Use --pages 1-5 to process specific pages, --json for machine-readable output.
MCP Server
pdf-oxide-mcp lets AI assistants (Claude, Cursor, etc.) extract content from PDFs locally via the Model Context Protocol.
Add to your MCP client configuration:
{
"mcpServers": {
"pdf-oxide": { "command": "crgx", "args": ["pdf_oxide_mcp@latest"] }
}
}
The server exposes an extract tool that supports text, markdown, and HTML output formats with optional page ranges and image extraction. All processing runs locally — no files leave your machine.
Building from Source
git clone https://github.com/yfedoseev/pdf_oxide
cd pdf_oxide
cargo build --release
cargo test
maturin develop
cargo build --release --lib
Documentation
Use Cases
- RAG / LLM pipelines — Convert PDFs to clean Markdown for retrieval-augmented generation with LangChain, LlamaIndex, or any framework
- AI assistants — Give Claude, Cursor, or any MCP-compatible tool direct PDF access via the MCP server
- Document processing at scale — Extract text, images, and metadata from thousands of PDFs in seconds
- Data extraction — Pull structured data from forms, tables, and layouts
- Academic research — Parse papers, extract citations, and process large corpora
- PDF generation — Create invoices, reports, certificates, and templated documents programmatically
- PyMuPDF alternative — MIT licensed, 5× faster, no AGPL restrictions
Why I built this
I needed PyMuPDF's speed without its AGPL license, and I needed it in more than one language. Nothing existed that ticked all three boxes — fast, MIT, multi-language — so I wrote it. The Rust core is what does the real work; the bindings for Python, Go, JS/TS, C#, and WASM are thin shells around the same code, so a bug fix in one lands in all of them. It now passes 100% of the veraPDF + Mozilla pdf.js + DARPA SafeDocs test corpora (3,830 PDFs) on every platform I've tested.
If it's useful to you, a star on GitHub genuinely helps. If something's broken or missing, open an issue — I read all of them.
— Yury
License
Dual-licensed under MIT or Apache-2.0 at your option. Unlike AGPL-licensed alternatives, pdf_oxide can be used freely in any project — commercial or open-source — with no copyleft restrictions.
Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
cargo build && cargo test && cargo fmt && cargo clippy -- -D warnings
Citation
@software{pdf_oxide,
title = {PDF Oxide: Fast PDF Toolkit for Rust, Python, Go, JavaScript, and C#},
author = {Yury Fedoseev},
year = {2025},
url = {https://github.com/yfedoseev/pdf_oxide}
}
Rust + Python + Go + JS/TS + C# + WASM + CLI + MCP | MIT/Apache-2.0 | 100% pass rate on 3,830 PDFs | 0.8ms mean | 5× faster than the industry leaders