
Security News
Deno 2.6 + Socket: Supply Chain Defense In Your CLI
Deno 2.6 introduces deno audit with a new --socket flag that plugs directly into Socket to bring supply chain security checks into the Deno CLI.
docstrange
Advanced tools
Extract and Convert PDF, Word, PowerPoint, Excel, images, URLs into multiple formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR.

DocStrange converts documents to Markdown, JSON, CSV, and HTML quickly and accurately.

βοΈ Free Cloud Processing upto 10000 docs per month !
Extract documents data instantly with the cloud processing - no complex setup needed
π Local Processing !
Usegpumode for 100% local processing - no data sent anywhere, everything stays on your machine.
August 2025
Convert and extract data from PDF, DOCX, images, and more into clean Markdown and structured JSON. Plus: Advanced table extraction, 100% local processing, and a built-in web UI.
DocStrange is a Python library for converting a wide range of document formatsβincluding PDF, DOCX, PPTX, XLSX, and images β into clean, usable data. It produces LLM-optimized Markdown, structured JSON (with schema support), HTML, and CSV outputs, making it an ideal tool for preparing content for RAG pipelines and other AI applications.
The library offers both a powerful cloud API and a 100% private, offline mode that runs locally on your GPU. Developed by Nanonets, DocStrange is built on a powerful pipeline of OCR and layout detection models and currently requires Python >=3.8.
To report a bug or request a feature, please file an issue. To ask a question or request assistance, please use the discussions forum.
DocStrange focuses on end-to-end document understanding (OCR β layout β tables β clean Markdown or structured JSON) that you can run 100% locally. It is designed to deliver high-quality results from scans and photos without requiring the integration of multiple services.
DocStrange offers a completely private, local processing option and gives you full control over the conversion pipeline.DocStrange is a ready-to-use parsing pipeline, not just a framework. It handles the complex OCR and layout analysis so you don't have to build it yourself.DocStrange is specifically built for robust OCR on scans and phone photos, not just digitally-native PDFs.Try the live demo: Test DocStrange instantly in your browser with no installation required at docstrange.nanonets.com
See it in action:

Install the library using pip:
pip install docstrange
π‘ New to DocStrange? Try the online demo first - no installation needed!
1. Convert any Document to LLM-Ready Markdown
This is the most common use case. Turn a complex PDF or DOCX file into clean, structured Markdown, perfect for RAG pipelines and other LLM applications.
from docstrange import DocumentExtractor
# Initialize extractor (cloud mode by default)
extractor = DocumentExtractor()
# Convert any document to clean markdown
result = extractor.extract("document.pdf")
markdown = result.extract_markdown()
print(markdown)
2. Extract Structured Data as JSON
Go beyond plain text and extract all detected entities and content from your document into a structured JSON format.
from docstrange import DocumentExtractor
# Extract document as structured JSON
extractor = DocumentExtractor()
result = extractor.extract("document.pdf")
# Get all important data as flat JSON
json_data = result.extract_data()
print(json_data)
3. Extract Specific Fields from a PDF or Invoice
Target only the key-value data you need, such as extracting the invoice_number or total_amount directly from a document.
from docstrange import DocumentExtractor
# Extract only the fields you need
extractor = DocumentExtractor()
result = extractor.extract("invoice.pdf")
# Specify exactly which fields to extract
fields = result.extract_data(specified_fields=[
"invoice_number", "total_amount", "vendor_name", "due_date"
])
print(fields)
4. Extract with Custom JSON Schema
Ensure the structure of your output by providing a custom JSON schema. This is ideal for getting reliable, nested data structures for applications that process contracts or complex forms.
from docstrange import DocumentExtractor
# Extract data conforming to your schema
extractor = DocumentExtractor()
result = extractor.extract("contract.pdf")
# Define your required structure
schema = {
"contract_number": "string",
"parties": ["string"],
"total_value": "number",
"start_date": "string",
"terms": ["string"]
}
structured_data = result.extract_data(json_schema=schema)
print(structured_data)
Local Processing
For complete privacy and offline capability, run DocStrange entirely on your own machine using GPU processing.
# Force local GPU processing (requires CUDA)
extractor = DocumentExtractor(gpu=True)
π‘ Want a GUI? Run the simple, drag-and-drop local web interface for private, offline document conversion.
For users who prefer a graphical interface, DocStrange includes a powerful, self-hosted web UI. This allows for easy drag-and-drop conversion of PDF, DOCX, and other files directly in your browser, with 100% private, offline processing on your own GPU. The interface automatically downloads required models on its first run.
pip install "docstrange[web]"
# Method 1: Using the CLI command
docstrange web
# Method 2: Using Python module
python -m docstrange.web_app
# Method 3: Direct Python import
python -c "from docstrange.web_app import run_web_app; run_web_app()"
http://localhost:8000 (or the port shown in the terminal)# Run on a different port
docstrange web --port 8080
python -c "from docstrange.web_app import run_web_app; run_web_app(port=8080)"
# Run with debug mode for development
python -c "from docstrange.web_app import run_web_app; run_web_app(debug=True)"
# Make accessible from other devices on the network
python -c "from docstrange.web_app import run_web_app; run_web_app(host='0.0.0.0')"
# Use a different port
docstrange web --port 8001
# Install with all dependencies
pip install -e ".[web]"
# Or install Flask separately
pip install Flask
Cloud Alternative
Need cloud processing? Use the official DocStrange Cloud service: π docstrange.nanonets.com
You can use DocStrange in three main ways: as a simple Web Interface, as a flexible Python Library, or as a powerful Command Line Interface (CLI). This section provides a summary of the library's key capabilities, followed by detailed guides and examples for each method.
DocStrange natively handles a wide variety of formats, returning the most appropriate output for each.
from docstrange import DocumentExtractor
extractor = DocumentExtractor()
# PDF document
pdf_result = extractor.extract("report.pdf")
print(pdf_result.extract_markdown())
# Word document
docx_result = extractor.extract("document.docx")
print(docx_result.extract_data())
# Excel spreadsheet
excel_result = extractor.extract("data.xlsx")
print(excel_result.extract_csv())
# PowerPoint presentation
pptx_result = extractor.extract("slides.pptx")
print(pptx_result.extract_html())
# Image with text
image_result = extractor.extract("screenshot.png")
print(image_result.extract_text())
# Web page
url_result = extractor.extract("https://example.com")
print(url_result.extract_markdown())
b. Extract Tables to CSV
Easily extracts all tables from a document into a clean CSV format.
# Extract all tables from a document
result = extractor.extract("financial_report.pdf")
csv_data = result.extract_csv()
print(csv_data)
c. Extract Specific Fields & Structured Data
You can go beyond simple conversion and extract data in the exact structure you require. There are two ways to do this. You can either target and pull only the key-value data you need or ensure the structure of your output by providing a custom JSON schema.
# Extract specific fields from any document
result = extractor.extract("invoice.pdf")
# Method 1: Extract specific fields
extracted = result.extract_data(specified_fields=[
"invoice_number",
"total_amount",
"vendor_name",
"due_date"
])
# Method 2: Extract using JSON schema
schema = {
"invoice_number": "string",
"total_amount": "number",
"vendor_name": "string",
"line_items": [{
"description": "string",
"amount": "number"
}]
}
structured = result.extract_data(json_schema=schema)
d. Cloud Mode Usage Examples:
Use DocStrange's cloud mode to extract precise, structured data from various documents by either specifying a list of fields to find or enforcing a custom JSON schema for the output. Authenticate with DocStrange login or a free API key to get 10,000 documents/month.
from docstrange import DocumentExtractor
# Default cloud mode (rate-limited without API key)
extractor = DocumentExtractor()
# Authenticated mode (10k docs/month) - run 'docstrange login' first
extractor = DocumentExtractor() # Auto-uses cached credentials
# With API key for 10k docs/month (alternative to login)
extractor = DocumentExtractor(api_key="your_api_key_here")
# Extract specific fields from invoice
result = extractor.extract("invoice.pdf")
# Extract key invoice information
invoice_fields = result.extract_data(specified_fields=[
"invoice_number",
"total_amount",
"vendor_name",
"due_date",
"items_count"
])
print("Extracted Invoice Fields:")
print(invoice_fields)
# Output: {"extracted_fields": {"invoice_number": "INV-001", ...}, "format": "specified_fields"}
# Extract structured data using schema
invoice_schema = {
"invoice_number": "string",
"total_amount": "number",
"vendor_name": "string",
"billing_address": {
"street": "string",
"city": "string",
"zip_code": "string"
},
"line_items": [{
"description": "string",
"quantity": "number",
"unit_price": "number",
"total": "number"
}],
"taxes": {
"tax_rate": "number",
"tax_amount": "number"
}
}
structured_invoice = result.extract_data(json_schema=invoice_schema)
print("Structured Invoice Data:")
print(structured_invoice)
# Output: {"structured_data": {...}, "schema": {...}, "format": "structured_json"}
# Extract from different document types
receipt = extractor.extract("receipt.jpg")
receipt_data = receipt.extract_data(specified_fields=[
"merchant_name", "total_amount", "date", "payment_method"
])
contract = extractor.extract("contract.pdf")
contract_schema = {
"parties": [{
"name": "string",
"role": "string"
}],
"contract_value": "number",
"start_date": "string",
"end_date": "string",
"key_terms": ["string"]
}
contract_data = contract.extract_data(json_schema=contract_schema)
e. Chain with LLM
The clean Markdown output is perfect for use in Retrieval-Augmented Generation (RAG) and other LLM workflows.
# Perfect for LLM workflows
document_text = extractor.extract("research_paper.pdf").extract_markdown()
# Use with any LLM
response = your_llm_client.chat(
messages=[{
"role": "user",
"content": f"Summarize this research paper:\n\n{document_text}"
}]
)
DocStrange uses a multi-stage process to create structured output from documents.
DocStrange offers free cloud processing with different tiers to ensure fair usage.
docstrange login.# Free tier usage (limited calls daily)
extractor = DocumentExtractor()
# Authenticated access (10k docs/month) - run 'docstrange login' first
extractor = DocumentExtractor() # Auto-uses cached credentials
# API key access (10k docs/month)
extractor = DocumentExtractor(api_key="your_api_key_here")
π‘ Tip: Start with the anonymous free tier to test functionality, then authenticate with docstrange login for the full 10,000 documents/month limit.
π‘ Prefer a GUI? Try the web interface for drag-and-drop document conversion!
For automation, scripting, and batch processing, you can use DocStrange directly from your terminal.
Authentication Commands
# One-time login for free 10k docs/month (alternative to api key)
docstrange login
# Alternatively
docstrange --login
# Re-authenticate if needed
docstrange login --reauth
# Logout and clear cached credentials
docstrange --logout
Document Processing
# Basic conversion (cloud mode default - limited calls free!)
docstrange document.pdf
# Authenticated processing (10k docs/month for free after login)
docstrange document.pdf
# With API key for 10k docs/month access (alternative to login)
docstrange document.pdf --api-key YOUR_API_KEY
# Local processing modes
docstrange document.pdf --gpu-mode
# Different output formats
docstrange document.pdf --output json
docstrange document.pdf --output html
docstrange document.pdf --output csv
# Extract specific fields
docstrange invoice.pdf --output json --extract-fields invoice_number total_amount
# Extract with JSON schema
docstrange document.pdf --output json --json-schema schema.json
# Multiple files
docstrange *.pdf --output markdown
# Save to file
docstrange document.pdf --output-file result.md
# Comprehensive field extraction examples
docstrange invoice.pdf --output json --extract-fields invoice_number vendor_name total_amount due_date line_items
# Extract from different document types with specific fields
docstrange receipt.jpg --output json --extract-fields merchant_name total_amount date payment_method
docstrange contract.pdf --output json --extract-fields parties contract_value start_date end_date
# Using JSON schema files for structured extraction
docstrange invoice.pdf --output json --json-schema invoice_schema.json
docstrange contract.pdf --output json --json-schema contract_schema.json
# Combine with authentication for 10k docs/month access (after 'docstrange login')
docstrange document.pdf --output json --extract-fields title author date summary
# Or use API key for 10k docs/month access (alternative to login)
docstrange document.pdf --api-key YOUR_API_KEY --output json --extract-fields title author date summary
Example schema.json file:
{
"invoice_number": "string",
"total_amount": "number",
"vendor_name": "string",
"billing_address": {
"street": "string",
"city": "string",
"zip_code": "string"
},
"line_items": [{
"description": "string",
"quantity": "number",
"unit_price": "number"
}]
}
This section details the main classes and methods for programmatic use.
DocumentExtractor(
api_key: str = None, # API key for 10k docs/month (or use 'docstrange login' for same limits)
model: str = None, # Model for cloud processing ("gemini", "openapi", "nanonets")
cpu: bool = False, # Force local CPU processing
gpu: bool = False # Force local GPU processing
)
b. ConversionResult Methods
result.extract_markdown() -> str # Clean markdown output
result.extract_data( # Structured JSON
specified_fields: List[str] = None, # Extract specific fields
json_schema: Dict = None # Extract with schema
) -> Dict
result.extract_html() -> str # Formatted HTML
result.extract_csv() -> str # CSV format for tables
result.extract_text() -> str # Plain text
The DocStrange repository includes an optional MCP (Model Context Protocol) server for local development that enables intelligent document processing in Claude Desktop with token-aware navigation.
Note: The MCP server is designed for local development and is not included in the PyPI package. Clone the repository to use it locally.
Features
Local Setup
git clone https://github.com/nanonets/docstrange.git
cd docstrange
pip install -e ".[dev]"
~/Library/Application Support/Claude/claude_desktop_config.json):{
"mcpServers": {
"docstrange": {
"command": "python3",
"args": ["/path/to/docstrange/mcp_server_module/server.py"]
}
}
}
For detailed setup and usage, see mcp_server_module/README.md
DocStrange is a powerful open-source library developed and maintained by the team at Nanonets. The full Nanonets platform is an AI-driven solution for automating end-to-end document processing for businesses. The platform allows technical and non-technical teams to build complete automated document workflows.
This is an actively developed open-source project, and we welcome your feedback and contributions.
β Star this repo if you find it helpful! Your support helps us improve the library.
License: This project is licensed under the MIT License.
FAQs
Extract and Convert PDF, Word, PowerPoint, Excel, images, URLs into multiple formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR.
We found that docstrange demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.Β It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Deno 2.6 introduces deno audit with a new --socket flag that plugs directly into Socket to bring supply chain security checks into the Deno CLI.

Security News
New DoS and source code exposure bugs in React Server Components and Next.js: whatβs affected and how to update safely.

Security News
Socket CEO Feross Aboukhadijeh joins Software Engineering Daily to discuss modern software supply chain attacks and rising AI-driven security risks.