
Security News
Attackers Are Hunting High-Impact Node.js Maintainers in a Coordinated Social Engineering Campaign
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.
markitdown-node
Advanced tools
TypeScript document extraction library inspired by markitdown. Converts PDF, DOCX, PPTX, XLSX, HTML, CSV, JSON, XML, RSS, Atom, ZIP, Jupyter Notebooks, Bing SERP, images (PNG, JPEG, TIFF with OCR), subtitles (VTT, SRT), and YouTube videos to JSON and Mark
A powerful TypeScript document extraction library that converts 20+ file formats into structured JSON and Markdown.
npm install markitdown-node
# or
pnpm install markitdown-node
This project uses pnpm workspace for better development experience:
# Install pnpm if you haven't
npm install -g pnpm
# Clone and setup
git clone https://github.com/leoning60/markitdown-node.git
cd markitdown-node
pnpm install
pnpm run build
import { MarkItDown } from 'markitdown-node';
const converter = new MarkItDown();
const result = await converter.convert('./document.docx');
if (result.status === 'success') {
console.log(result.markdown_content); // ✨ Auto-generated Markdown
console.log(result.json_content); // ✨ Auto-generated JSON
}
One-liner conversions:
import { convertToMarkdown, convertToJSON } from 'markitdown-node';
const markdown = await convertToMarkdown('./document.pdf');
const json = await convertToJSON('./data.xlsx');
| Category | Formats | Notes |
|---|---|---|
| Documents | PDF, DOCX, PPTX, XLSX | Office documents and spreadsheets |
| Web | HTML, RSS, Atom | Web pages and feeds |
| Images | PNG, JPEG, TIFF | With EXIF metadata and OCR |
| Media | Audio (WAV, MP3, etc.), YouTube | Audio transcription via LLM, YouTube transcripts |
| Code | Jupyter Notebooks (.ipynb) | Markdown cells, code, and outputs |
| Text | TXT, CSV, JSON, XML | Plain text and structured data |
| Subtitles | SRT, VTT | Subtitle files |
| Archives | ZIP | Recursive extraction |
| Search | Bing SERP | Search result pages |
const converter = new MarkItDown();
// PDF to Markdown
const pdf = await converter.convert('./report.pdf');
// Excel to JSON
const excel = await converter.convert('./data.xlsx');
console.log(excel.json_content); // Table structure
// Image with OCR
const image = await converter.convert('./document.png');
console.log(image.markdown_content); // Extracted text
// CSV → Table structure
const csv = await converter.convert('./data.csv');
// Outputs Markdown table and structured JSON
// JSON → Formatted output
const json = await converter.convert('./config.json');
// Pretty-printed code block with extracted fields
// Generic XML
const xml = await converter.convert('./config.xml');
// RSS Feed → Structured articles
const rss = await converter.convert('./feed.rss');
// Channel metadata + all articles
// Atom Feed → Structured entries
const atom = await converter.convert('./feed.atom');
// Requires: pnpm install unzipper
const result = await converter.convert('./archive.zip');
// All files in ZIP are extracted and converted
// Requires: pnpm install youtube-transcript
const converter = new MarkItDown({
defaultOptions: {
enableTranscript: true,
transcriptLanguage: 'en',
},
});
const result = await converter.convert(youtubeHTML, {
url: 'https://www.youtube.com/watch?v=VIDEO_ID',
});
// Requires LLM configuration (OpenAI, etc.)
const result = await converter.convert('./audio.wav');
console.log(result.markdown_content); // Transcribed text
const result = await converter.convert('./notebook.ipynb');
// Markdown cells, code cells, and outputs are preserved
// Extract search results from Bing HTML
const result = await converter.convert('./bing-results.html');
// Structured search results with titles, URLs, descriptions
const converter = new MarkItDown({
defaultOptions: {
ocrLanguages: 'chi_sim+eng', // OCR: Chinese + English
extractImages: true,
extractTables: true,
},
});
This project uses pnpm workspace. Examples automatically use the local package:
# First time setup
pnpm install
pnpm run build
# Run examples
cd examples
node 01-quick-start.js # Basic usage
node 02-all-formats.js # All supported formats
node 03-docx-example.js # Word documents
node 04-pdf-example.js # PDF documents
node 05-image-example.js # OCR from images
node 06-excel-example.js # Excel spreadsheets
node 07-powerpoint-example.js # PowerPoint presentations
node 08-html-example.js # HTML pages
node 09-subtitle-example.js # Subtitle files
node 10-convenience-functions.js # Convenience functions
node 11-ocr-languages.js # OCR with multiple languages
node 12-bing-serp-example.js # Bing SERP results
node 13-ipynb-example.js # Jupyter Notebooks
node 14-csv-json-example.js # CSV and JSON files
After modifying source code, just rebuild:
pnpm run build
cd examples
node 01-quick-start.js # Automatically uses latest build
See examples/README.md for more details.
Images are processed with Tesseract.js OCR, supporting 110+ languages.
const converter = new MarkItDown({
defaultOptions: {
ocrLanguages: 'chi_sim+eng' // Default: Chinese + English
}
});
// English only
ocrLanguages: 'eng'
// Japanese + English
ocrLanguages: 'jpn+eng'
// Multiple languages
ocrLanguages: 'chi_sim+eng+fra'
| Language | Code | Language | Code |
|---|---|---|---|
| English | eng | Spanish | spa |
| Chinese (Simplified) | chi_sim | French | fra |
| Chinese (Traditional) | chi_tra | German | deu |
| Japanese | jpn | Italian | ita |
| Korean | kor | Portuguese | por |
| Russian | rus | Arabic | ara |
| Hindi | hin | Thai | tha |
| Vietnamese | vie | Turkish | tur |
📖 Full language list (110+ languages supported)
Some formats require additional packages:
# For ZIP file support
pnpm install unzipper
# For YouTube transcript extraction
pnpm install youtube-transcript
# For audio transcription (LLM-based)
# Configure your LLM provider (OpenAI, etc.) in the options
interface ConversionResult {
status: 'success' | 'error';
document?: Document; // Structured document object
json_content?: DocumentItem[]; // ✨ Auto-generated JSON
markdown_content?: string; // ✨ Auto-generated Markdown
errors?: string[];
warnings?: string[];
}
interface Document {
metadata: {
filename: string;
format: InputFormat;
title?: string;
author?: string;
// ... more metadata
};
content: DocumentItem[]; // Array of content items
}
interface DocumentItem {
type: 'text' | 'heading' | 'paragraph' | 'list' | 'table' | ...;
text?: string;
level?: number;
children?: DocumentItem[];
// ... more fields
}
enum InputFormat {
// Documents
PDF = 'pdf',
DOCX = 'docx',
PPTX = 'pptx',
XLSX = 'xlsx',
// Web & Feeds
HTML = 'html',
RSS = 'rss',
ATOM = 'atom',
// Text & Data
TEXT = 'text',
CSV = 'csv',
JSON = 'json',
XML = 'xml',
// Media
IMAGE = 'image',
AUDIO = 'audio',
YOUTUBE = 'youtube',
// Code & Archives
IPYNB = 'ipynb',
ZIP = 'zip',
// Subtitles
SUBTITLE = 'subtitle',
// Special
BINGSERP = 'bingserp',
}
This project uses pnpm workspace:
# Install dependencies
pnpm install
# Build
pnpm run build
# Watch mode
pnpm run dev
# Type check
pnpm run typecheck
# Clean build artifacts
pnpm run clean
# Rebuild from scratch
pnpm run rebuild
# Dry run to check what will be published
pnpm run publish:dry-run
# Release (bumps version, commits, tags, and publishes)
pnpm run release
markitdown-node/
├── pnpm-workspace.yaml # Workspace configuration
├── package.json # Main package
├── src/ # Source code
│ ├── converter.ts # Main converter class
│ ├── backends/ # Format-specific backends
│ ├── exporters/ # JSON and Markdown exporters
│ └── types/ # TypeScript types
├── dist/ # Built output (generated)
├── examples/ # Example usage (workspace package)
│ ├── package.json # Uses "workspace:*" dependency
│ └── *.js # Example files
└── README.md # This file
MIT
Inspired by markitdown by Microsoft.
FAQs
TypeScript document extraction library inspired by markitdown. Converts PDF, DOCX, PPTX, XLSX, HTML, CSV, JSON, XML, RSS, Atom, ZIP, Jupyter Notebooks, Bing SERP, images (PNG, JPEG, TIFF with OCR), subtitles (VTT, SRT), and YouTube videos to JSON and Mark
We found that markitdown-node demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.

Security News
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.

Security News
Node.js has paused its bug bounty program after funding ended, removing payouts for vulnerability reports but keeping its security process unchanged.