🚀 Big News:Socket Has Acquired Secure Annex.Learn More →
Socket
Book a DemoSign in
Socket

doc2mark

Package Overview
Dependencies
Maintainers
1
Versions
22
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

doc2mark

Unified document processing with AI-powered OCR

pipPyPI
Version
0.4.0
Maintainers
1

doc2mark

PyPI version Python License: MIT

Turn any document into clean Markdown -- in one line.

Features

  • Converts PDFs, DOCX/XLSX/PPTX, images, HTML, CSV/JSON, and more
  • AI-powered OCR via OpenAI, Google Gemini (Vertex AI), or Tesseract
  • Preserves complex tables (merged cells, rowspan/colspan)
  • One unified API + CLI for single files or entire directories
  • Batch processing with parallel execution
  • Per-call token usage tracking for OpenAI and Vertex AI providers

Install

# Core (no OCR)
pip install doc2mark

# With OpenAI OCR
pip install doc2mark[ocr]

# With Google Gemini / Vertex AI OCR
pip install doc2mark[vertex_ai]

# Everything
pip install doc2mark[all]

Quick start

from doc2mark import UnifiedDocumentLoader

loader = UnifiedDocumentLoader()
result = loader.load("document.pdf")
print(result.content)

OCR providers

doc2mark supports three OCR providers. Pass ocr_provider to UnifiedDocumentLoader to choose one.

OpenAI (default)

Uses GPT-4.1 vision. Requires an API key.

export OPENAI_API_KEY=sk-...
loader = UnifiedDocumentLoader(ocr_provider="openai")

result = loader.load(
    "scanned_doc.pdf",
    extract_images=True,
    ocr_images=True,
)

Customize the model or use an OpenAI-compatible endpoint:

loader = UnifiedDocumentLoader(
    ocr_provider="openai",
    model="gpt-4o-mini",                     # cheaper model
    base_url="http://localhost:11434/v1",     # self-hosted / Ollama
    api_key="any-string",
)

Google Gemini / Vertex AI

Uses Gemini models via Google Cloud. Authenticates with Application Default Credentials.

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json
loader = UnifiedDocumentLoader(
    ocr_provider="vertex_ai",
    project="my-gcp-project",          # or set GOOGLE_CLOUD_PROJECT
)

result = loader.load("scan.pdf", extract_images=True, ocr_images=True)

Override model and region:

loader = UnifiedDocumentLoader(
    ocr_provider="vertex_ai",
    project="my-gcp-project",
    model="gemini-2.0-flash",          # default: gemini-3.1-flash-lite-preview
    location="us-central1",            # default: global
)

Tesseract (offline)

Local OCR, no API key needed. Requires Tesseract installed on your system.

from doc2mark.ocr.base import OCRConfig

loader = UnifiedDocumentLoader(
    ocr_provider="tesseract",
    ocr_config=OCRConfig(language="chinese"),   # optional language hint
)

result = loader.load("scan.png", extract_images=True, ocr_images=True)

Provider comparison

ProviderRequiresBest forInstall extra
openaiOPENAI_API_KEYHighest accuracy, complex layoutspip install doc2mark[ocr]
vertex_aiGCP service accountGoogle Cloud workflows, Gemini modelspip install doc2mark[vertex_ai]
tesseractTesseract binaryOffline / air-gapped environmentspip install doc2mark[ocr]

Supported formats

CategoryFormats
OfficeDOCX, XLSX, PPTX
PDFPDF (text + scanned)
ImagesPNG, JPG, WEBP, TIFF, BMP, GIF, HEIC, HEIF, AVIF
Text / DataTXT, CSV, TSV, JSON, JSONL
MarkupHTML, XML, Markdown
LegacyDOC, XLS, PPT, RTF, PPS (requires LibreOffice)

Common recipes

Single file

from doc2mark import load

# Text-only extraction (no OCR)
md = load("report.pdf").content

# With OCR for embedded images
md = load("report.pdf", extract_images=True, ocr_images=True).content

Batch processing

from doc2mark import UnifiedDocumentLoader

loader = UnifiedDocumentLoader(ocr_provider="openai")

loader.batch_process(
    input_dir="documents/",
    output_dir="converted/",
    extract_images=True,
    ocr_images=True,
    save_files=True,
    show_progress=True,
)

Process specific files

from doc2mark import batch_process_files

results = batch_process_files(
    ["invoice.pdf", "contract.docx", "receipt.png"],
    output_dir="output/",
    extract_images=True,
    ocr_images=True,
)

OCR prompt templates

doc2mark includes specialized prompts for different content types:

loader = UnifiedDocumentLoader(
    ocr_provider="openai",
    prompt_template="table_focused",    # optimized for tables
)

Available templates: default, table_focused, document_focused, multilingual, form_focused, receipt_focused, handwriting_focused, code_focused.

Table output styles

Control how complex tables (with merged cells) are rendered:

loader = UnifiedDocumentLoader(
    table_style="minimal_html",     # clean HTML with rowspan/colspan (default)
    # table_style="markdown_grid",  # markdown with merge annotations
    # table_style="styled_html",    # full HTML with inline styles
)

Token usage tracking

When using OpenAI or Vertex AI, each OCR result includes token usage in its metadata:

from doc2mark import UnifiedDocumentLoader

loader = UnifiedDocumentLoader(ocr_provider="openai")
result = loader.load("scan.pdf", extract_images=True, ocr_images=True)

# Token usage per OCR call is in result metadata
usage = result.metadata.get("token_usage", {})
print(usage)
# {"input_tokens": 1234, "output_tokens": 567, "total_tokens": 1801}

CLI

# Single file to stdout
doc2mark report.pdf

# Save to file
doc2mark report.pdf -o report.md

# Batch convert a directory
doc2mark documents/ -o converted/ -r

# With OpenAI OCR
doc2mark scan.pdf --ocr openai --ocr-images

# With Tesseract OCR
doc2mark scan.pdf --ocr tesseract --ocr-images

# Disable OCR entirely
doc2mark report.pdf --ocr none --no-ocr-images

# JSON output
doc2mark report.pdf --format json

License

MIT -- see LICENSE.

Keywords

document-processing

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts