
Product
Socket MCP Adds Org Alerts, Threat Feed Review, and Package Inspection
Socket MCP now lets AI assistants review org alerts, investigate threats using the Socket threat feed, and inspect package files in addition to dependency scoring.
markitdown-pro
Advanced tools
MarkItDown-Pro is a Python library that converts 50+ document formats into Markdown, built to power RAG (Retrieval-Augmented Generation) pipelines for semantic search. It extends Microsoft MarkItDown with Azure AI services, per-page OCR, and customizable converter pipelines.
async, designed for concurrent document processingprebuilt-layout modelComponent | filename.ext | page N | method | message format for Log Analytics| Category | Formats |
|---|---|
.pdf (text, scanned, mixed — per-page routing) | |
| Office | .docx, .pptx (via Gotenberg + DocIntelligence + MarkItDown) |
| Spreadsheet | .csv, .tsv, .xls, .xlsx |
| Images | .png, .jpg, .jpeg, .gif, .bmp, .svg, .tiff, .webp, .heic, .heif |
| Audio | .mp3, .wav |
.eml, .msg, .p7s | |
| Archives | .pst (Outlook) |
| E-books | .epub |
| Notebooks | .ipynb |
| Markup | .html, .htm, .xml, .json, .ndjson, .yaml, .yml |
| Text | .txt, .md, .py, .go |
ConversionPipeline (async)
|-- detect extension
|-- route to Handler
| |-- try Converter 1 (primary)
| |-- try Converter 2 (fallback)
| |-- try Converter N
|-- validate content
|-- clean markdown
| Handler | Pipeline (in order) | What each captures |
|---|---|---|
| PDFHandler | MarkItDown (all-text only) → PagePDFConverter (per-page: PyMuPDF → GPT Vision → DocIntelligence) | Text + images + scanned content |
| OfficeHandler | Gotenberg → DocIntelligence → MarkItDown | Text + images (via Gotenberg PDF conversion) |
| ImageHandler | GPT Vision (primary model) → GPT Vision (fallback model) | OCR on images |
| AudioHandler | Azure Speech | Transcription |
| TabularHandler | openpyxl/pandas | Tables to markdown |
| MarkupHandler | BeautifulSoup/yaml/json | Structured markup |
| TextHandler | chardet encoding detection | Raw text |
| EmailHandler | Python email parser | Email text + attachments |
ffmpeg (audio)git clone https://github.com/your-org/markitdown-pro.git
cd markitdown-pro
# Install all dependencies (creates .venv automatically)
uv sync
# With dev tools (pytest, ruff)
uv sync --dev
Create a .env file in the project root:
# Azure OpenAI (required for GPT Vision OCR)
AZURE_OPENAI_ENDPOINT="https://<resource>.openai.azure.com"
AZURE_OPENAI_API_KEY="your-key"
AZURE_OPENAI_API_VERSION="2024-12-01-preview"
# Azure Document Intelligence (required for doc intelligence fallback)
AZURE_DOCINTEL_ENDPOINT="https://<resource>.cognitiveservices.azure.com"
AZURE_DOCINTEL_KEY="your-key"
# Azure Speech (required for audio transcription)
AZURE_SPEECH_KEY="your-key"
AZURE_SPEECH_REGION="eastus"
# Gotenberg (optional — for Office → PDF → OCR)
GOTENBERG_URL="http://gotenberg:3000"
# OCR model configuration (optional — defaults shown)
MARKITDOWN_OCR_MODEL="gpt-5.4-mini"
MARKITDOWN_OCR_FALLBACK_MODEL="gpt-5.4"
MARKITDOWN_OCR_TIMEOUT="60.0"
MARKITDOWN_OCR_MAX_RETRIES="6"
MARKITDOWN_MIN_IMAGE_AREA="150000"
# General
LOG_LEVEL=20 # 10=DEBUG, 20=INFO, 30=WARNING
All services are optional -- the library degrades gracefully when credentials are missing.
import asyncio
from markitdown_pro.conversion_pipeline import ConversionPipeline
async def main():
pipeline = ConversionPipeline()
try:
md = await pipeline.convert_document_to_md("/path/to/document.pdf")
print(md)
finally:
await pipeline.aclose()
asyncio.run(main())
Skip Gotenberg and GPT Vision, use only local converters:
from markitdown_pro.conversion_pipeline import ConversionPipeline
from markitdown_pro.converters.markitdown_converter import MarkItDownConverter
from markitdown_pro.converters.doc_intel_converter import DocIntelligenceConverter
from markitdown_pro.handlers.office_handler import OfficeHandler
# Office: MarkItDown first (fast, local), DocIntelligence fallback
office = OfficeHandler(pipeline=[
(MarkItDownConverter(), "MarkItDown"),
(DocIntelligenceConverter(), "DocIntelligence"),
])
pipeline = ConversionPipeline(office_handler=office)
Ensure Office files go through Gotenberg for full OCR:
from markitdown_pro.converters.gotenberg_converter import GotenbergConverter
from markitdown_pro.converters.doc_intel_converter import DocIntelligenceConverter
from markitdown_pro.converters.markitdown_converter import MarkItDownConverter
from markitdown_pro.handlers.office_handler import OfficeHandler
office = OfficeHandler(pipeline=[
(GotenbergConverter(gotenberg_url="http://localhost:3000"), "Gotenberg"),
(DocIntelligenceConverter(), "DocIntelligence"),
(MarkItDownConverter(), "MarkItDown"),
])
pipeline = ConversionPipeline(office_handler=office)
from markitdown_pro.converters.markitdown_converter import MarkItDownConverter
from markitdown_pro.handlers.office_handler import OfficeHandler
office = OfficeHandler(pipeline=[
(MarkItDownConverter(), "MarkItDown"),
])
pipeline = ConversionPipeline(office_handler=office)
from markitdown_pro.converters.gpt_vision_converter import GPTVisionConverter
from markitdown_pro.handlers.image_handler import ImageHandler
image = ImageHandler(pipeline=[
GPTVisionConverter(model_name="gpt-5.4-nano"), # cheapest
])
pipeline = ConversionPipeline(image_handler=image)
pipeline = ConversionPipeline(
pdf_handler=my_pdf_handler,
office_handler=my_office_handler,
image_handler=my_image_handler,
audio_handler=my_audio_handler,
# text, tabular, markup, email, epub, pst, ipynb also injectable
)
Gotenberg converts Office files to PDF for per-page OCR. Run it as a Docker container:
docker run -d -p 3000:3000 gotenberg/gotenberg:8
Or in Docker Compose:
services:
gotenberg:
image: gotenberg/gotenberg:8
ports:
- "3000:3000"
Set the URL in your environment:
GOTENBERG_URL=http://localhost:3000
# Unit tests (fast, no credentials, ~0.2s)
uv run pytest tests/unit/ -v
# Integration tests (requires Azure credentials in .env)
uv run pytest -m integration -v
# Local-only integration tests (no Azure needed)
uv run pytest tests/integration/test_text_handler.py tests/integration/test_markup_handler.py tests/integration/test_tabular_handler.py tests/integration/test_email_handler.py -v
# All tests
uv run pytest tests/unit/ tests/integration/ -v
Test expectations are defined per-handler in tests/data/test_expectations.yaml.
# Lint
uv run ruff check markitdown_pro/
# Format
uv run ruff format markitdown_pro/
# Build package
uv build
MIT
FAQs
A package that converts almost any file format to Markdown.
The pypi package markitdown-pro receives a total of 42 weekly downloads. As such, markitdown-pro popularity was classified as not popular.
We found that markitdown-pro demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Product
Socket MCP now lets AI assistants review org alerts, investigate threats using the Socket threat feed, and inspect package files in addition to dependency scoring.

Product
Socket Firewall blocks malicious VS Code and Open VSX extensions before install, protecting developers from compromised editor marketplaces.

Research
More than 140 Mastra npm packages were compromised in a supply chain attack that used a typosquatted dependency to deliver a cross-platform infostealer during installation.