doc2mark

Turn any document into clean Markdown – in one line.
Why doc2mark?
- Converts PDFs, DOCX/XLSX/PPTX, images, HTML, CSV/JSON, and more
- AI OCR for scans and screenshots (OpenAI)
- Preserves complex tables (merged cells, headers) and basic layout
- One simple API + CLI for single files or whole folders
Install
pip install doc2mark[all]
Try it in 30 seconds
from doc2mark import UnifiedDocumentLoader
loader = UnifiedDocumentLoader(ocr_provider='openai')
result = loader.load('sample_documents/sample_pdf.pdf', extract_images=True, ocr_images=True)
print(result.content)
CLI:
doc2mark sample_documents/sample_document.docx
doc2mark sample_documents -o output -r
export OPENAI_API_KEY=sk-...
doc2mark sample_documents/sample_pdf.pdf --ocr openai --ocr-images
Supported formats
- PDF • DOCX • XLSX • PPTX • Images (PNG/JPG/WEBP) • TXT/CSV/TSV/JSON/JSONL • HTML/XML/MD
- Legacy Office (DOC/XLS/PPT/RTF/PPS) via LibreOffice (optional)
Common recipes
from doc2mark import UnifiedDocumentLoader
loader = UnifiedDocumentLoader(ocr_provider='openai')
print(loader.load('document.pdf').content)
print(loader.load('screenshot.png', extract_images=True, ocr_images=True).content)
loader.batch_process(
input_dir='documents/',
output_dir='converted/',
extract_images=True,
ocr_images=True,
show_progress=True,
save_files=True
)
OpenAI OCR (optional)
export OPENAI_API_KEY=your_key
loader = UnifiedDocumentLoader(ocr_provider='openai')
Use OpenAI‑compatible endpoints (self‑hosted/offline VLM):
loader = UnifiedDocumentLoader(
ocr_provider='openai',
base_url='http://localhost:11434/v1',
api_key='your-key-or-any-string',
model='gpt-4o-mini'
)
Tips
- Use
extract_images=True, ocr_images=True to convert images to text
batch_process(..., save_files=True) writes .md (and .json when requested)
- Sample files live in
sample_documents/ — perfect for a quick test
License
MIT — see LICENSE.