markitdown-node
A powerful TypeScript document extraction library that converts 20+ file formats into structured JSON and Markdown.
Features
- 📄 20+ Format Support: Documents (PDF, DOCX, PPTX, XLSX), Web (HTML, RSS, Atom), Images (OCR), Media (Audio, YouTube), Code (Jupyter), Archives (ZIP), Search (Bing SERP), and more
- 🔄 Unified API: Simple, consistent interface for all formats
- 📊 Dual Output: Auto-generates both JSON and Markdown from a single conversion
- 🎯 TypeScript: Full type safety and IntelliSense support
- 🚀 Zero Config: Works out of the box with sensible defaults
- 🖼️ OCR Support: Extract text from images using Tesseract.js (110+ languages)
- 🎙️ Audio Transcription: Convert audio files to text (via LLM integration)
- 📦 pnpm Workspace: Optimized development experience with automatic linking
Installation
For Users
npm install markitdown-node
pnpm install markitdown-node
For Development
This project uses pnpm workspace for better development experience:
npm install -g pnpm
git clone https://github.com/leoning60/markitdown-node.git
cd markitdown-node
pnpm install
pnpm run build
Quick Start
import { MarkItDown } from 'markitdown-node';
const converter = new MarkItDown();
const result = await converter.convert('./document.docx');
if (result.status === 'success') {
console.log(result.markdown_content);
console.log(result.json_content);
}
One-liner conversions:
import { convertToMarkdown, convertToJSON } from 'markitdown-node';
const markdown = await convertToMarkdown('./document.pdf');
const json = await convertToJSON('./data.xlsx');
Supported Formats
| Documents | PDF, DOCX, PPTX, XLSX | Office documents and spreadsheets |
| Web | HTML, RSS, Atom | Web pages and feeds |
| Images | PNG, JPEG, TIFF | With EXIF metadata and OCR |
| Media | Audio (WAV, MP3, etc.), YouTube | Audio transcription via LLM, YouTube transcripts |
| Code | Jupyter Notebooks (.ipynb) | Markdown cells, code, and outputs |
| Text | TXT, CSV, JSON, XML | Plain text and structured data |
| Subtitles | SRT, VTT | Subtitle files |
| Archives | ZIP | Recursive extraction |
| Search | Bing SERP | Search result pages |
Usage Examples
Convert Documents
const converter = new MarkItDown();
const pdf = await converter.convert('./report.pdf');
const excel = await converter.convert('./data.xlsx');
console.log(excel.json_content);
const image = await converter.convert('./document.png');
console.log(image.markdown_content);
CSV and JSON Files
const csv = await converter.convert('./data.csv');
const json = await converter.convert('./config.json');
XML, RSS, and Atom Feeds
const xml = await converter.convert('./config.xml');
const rss = await converter.convert('./feed.rss');
const atom = await converter.convert('./feed.atom');
const result = await converter.convert('./archive.zip');
YouTube Transcripts
const converter = new MarkItDown({
defaultOptions: {
enableTranscript: true,
transcriptLanguage: 'en',
},
});
const result = await converter.convert(youtubeHTML, {
url: 'https://www.youtube.com/watch?v=VIDEO_ID',
});
Audio Transcription
const result = await converter.convert('./audio.wav');
console.log(result.markdown_content);
Jupyter Notebooks
const result = await converter.convert('./notebook.ipynb');
Bing SERP
const result = await converter.convert('./bing-results.html');
Custom Options
const converter = new MarkItDown({
defaultOptions: {
ocrLanguages: 'chi_sim+eng',
extractImages: true,
extractTables: true,
},
});
Running Examples
This project uses pnpm workspace. Examples automatically use the local package:
pnpm install
pnpm run build
cd examples
node 01-quick-start.js
node 02-all-formats.js
node 03-docx-example.js
node 04-pdf-example.js
node 05-image-example.js
node 06-excel-example.js
node 07-powerpoint-example.js
node 08-html-example.js
node 09-subtitle-example.js
node 10-convenience-functions.js
node 11-ocr-languages.js
node 12-bing-serp-example.js
node 13-ipynb-example.js
node 14-csv-json-example.js
After modifying source code, just rebuild:
pnpm run build
cd examples
node 01-quick-start.js
See examples/README.md for more details.
OCR Configuration
Images are processed with Tesseract.js OCR, supporting 110+ languages.
Configure Languages
const converter = new MarkItDown({
defaultOptions: {
ocrLanguages: 'chi_sim+eng'
}
});
ocrLanguages: 'eng'
ocrLanguages: 'jpn+eng'
ocrLanguages: 'chi_sim+eng+fra'
Common Language Codes
| English | eng | Spanish | spa |
| Chinese (Simplified) | chi_sim | French | fra |
| Chinese (Traditional) | chi_tra | German | deu |
| Japanese | jpn | Italian | ita |
| Korean | kor | Portuguese | por |
| Russian | rus | Arabic | ara |
| Hindi | hin | Thai | tha |
| Vietnamese | vie | Turkish | tur |
📖 Full language list (110+ languages supported)
Optional Dependencies
Some formats require additional packages:
pnpm install unzipper
pnpm install youtube-transcript
API Types
ConversionResult
interface ConversionResult {
status: 'success' | 'error';
document?: Document;
json_content?: DocumentItem[];
markdown_content?: string;
errors?: string[];
warnings?: string[];
}
Document Structure
interface Document {
metadata: {
filename: string;
format: InputFormat;
title?: string;
author?: string;
};
content: DocumentItem[];
}
interface DocumentItem {
type: 'text' | 'heading' | 'paragraph' | 'list' | 'table' | ...;
text?: string;
level?: number;
children?: DocumentItem[];
}
InputFormat Enum
enum InputFormat {
PDF = 'pdf',
DOCX = 'docx',
PPTX = 'pptx',
XLSX = 'xlsx',
HTML = 'html',
RSS = 'rss',
ATOM = 'atom',
TEXT = 'text',
CSV = 'csv',
JSON = 'json',
XML = 'xml',
IMAGE = 'image',
AUDIO = 'audio',
YOUTUBE = 'youtube',
IPYNB = 'ipynb',
ZIP = 'zip',
SUBTITLE = 'subtitle',
BINGSERP = 'bingserp',
}
Development
This project uses pnpm workspace:
pnpm install
pnpm run build
pnpm run dev
pnpm run typecheck
pnpm run clean
pnpm run rebuild
Publishing
pnpm run publish:dry-run
pnpm run release
Project Structure
markitdown-node/
├── pnpm-workspace.yaml # Workspace configuration
├── package.json # Main package
├── src/ # Source code
│ ├── converter.ts # Main converter class
│ ├── backends/ # Format-specific backends
│ ├── exporters/ # JSON and Markdown exporters
│ └── types/ # TypeScript types
├── dist/ # Built output (generated)
├── examples/ # Example usage (workspace package)
│ ├── package.json # Uses "workspace:*" dependency
│ └── *.js # Example files
└── README.md # This file
License
MIT
Acknowledgments
Inspired by markitdown by Microsoft.