
Security News
Software Engineering Daily Podcast: Feross on AI, Open Source, and Supply Chain Risk
Socket CEO Feross Aboukhadijeh joins Software Engineering Daily to discuss modern software supply chain attacks and rising AI-driven security risks.
html-to-markdown
Advanced tools
High-performance HTML to Markdown converter powered by Rust with a clean Python API
High-performance HTML to Markdown converter with a clean Python API (powered by a Rust core). The same engine also drives the Node.js, Ruby, PHP, and WebAssembly bindings, so rendered Markdown stays identical across runtimes. Wheels are published for Linux, macOS, and Windows.
pip install html-to-markdown
Apple M4 • Real Wikipedia documents • convert() (Python)
| Document | Size | Latency | Throughput | Docs/sec |
|---|---|---|---|---|
| Lists (Timeline) | 129KB | 0.62ms | 208 MB/s | 1,613 |
| Tables (Countries) | 360KB | 2.02ms | 178 MB/s | 495 |
| Mixed (Python wiki) | 656KB | 4.56ms | 144 MB/s | 219 |
V1 averaged ~2.5 MB/s (Python/BeautifulSoup). V2's Rust engine delivers 60–80× higher throughput.
Pulled directly from tools/runtime-bench (task bench:bindings -- --language python) so they stay in lockstep with the Rust core:
| Document | Size | ops/sec (Python) |
|---|---|---|
| Lists (Timeline) | 129 KB | 1,405 |
| Tables (Countries) | 360 KB | 352 |
| Medium (Python) | 657 KB | 158 |
| Large (Rust) | 567 KB | 183 |
| Small (Intro) | 463 KB | 223 |
| hOCR German PDF | 44 KB | 2,991 |
| hOCR Invoice | 4 KB | 23,500 |
| hOCR Embedded Tables | 37 KB | 3,464 |
Re-run locally with
task bench:bindings -- --language python --output tmp.jsonto compare against CI history.
from html_to_markdown import convert
html = """
<h1>Welcome</h1>
<p>This is <strong>fast</strong> Rust-powered conversion!</p>
<ul>
<li>Blazing fast</li>
<li>Type safe</li>
<li>Easy to use</li>
</ul>
"""
markdown = convert(html)
print(markdown)
from html_to_markdown import ConversionOptions, convert
options = ConversionOptions(
heading_style="atx",
list_indent_width=2,
bullets="*+-",
)
options.escape_asterisks = True
options.code_language = "python"
options.extract_metadata = True
markdown = convert(html, options)
Avoid re-parsing the same option dictionaries inside hot loops by building a reusable handle:
from html_to_markdown import ConversionOptions, convert_with_handle, create_options_handle
handle = create_options_handle(ConversionOptions(hocr_spatial_tables=False))
for html in documents:
markdown = convert_with_handle(html, handle)
from html_to_markdown import ConversionOptions, PreprocessingOptions, convert
options = ConversionOptions(
...
)
preprocessing = PreprocessingOptions(
enabled=True,
preset="aggressive",
)
markdown = convert(scraped_html, options, preprocessing)
from html_to_markdown import InlineImageConfig, convert_with_inline_images
markdown, inline_images, warnings = convert_with_inline_images(
'<p><img src="data:image/png;base64,...==" alt="Pixel" width="1" height="1"></p>',
image_config=InlineImageConfig(max_decoded_size_bytes=1024, infer_dimensions=True),
)
if inline_images:
first = inline_images[0]
print(first["format"], first["dimensions"], first["attributes"]) # e.g. "png", (1, 1), {"width": "1"}
Each inline image is returned as a typed dictionary (bytes payload, metadata, and relevant HTML attributes). Warnings are human-readable skip reasons.
Extract comprehensive metadata (title, description, headers, links, images, structured data) during conversion in a single pass.
from html_to_markdown import convert_with_metadata
html = """
<html>
<head>
<title>Example Article</title>
<meta name="description" content="Demo page">
<link rel="canonical" href="https://example.com/article">
</head>
<body>
<h1 id="welcome">Welcome</h1>
<a href="https://example.com" rel="nofollow external">Example link</a>
<img src="https://example.com/image.jpg" alt="Hero" width="640" height="480">
</body>
</html>
"""
markdown, metadata = convert_with_metadata(html)
print(markdown)
print(metadata["document"]["title"]) # "Example Article"
print(metadata["headers"][0]["text"]) # "Welcome"
print(metadata["links"][0]["href"]) # "https://example.com"
print(metadata["images"][0]["dimensions"]) # (640, 480)
Control which metadata types are extracted using MetadataConfig:
from html_to_markdown import ConversionOptions, MetadataConfig, convert_with_metadata
options = ConversionOptions(heading_style="atx")
config = MetadataConfig(
extract_headers=True, # h1-h6 elements (default: True)
extract_links=True, # <a> hyperlinks (default: True)
extract_images=True, # <img> elements (default: True)
extract_structured_data=True, # JSON-LD, Microdata, RDFa (default: True)
max_structured_data_size=1_000_000, # Max bytes for structured data (default: 100KB)
)
markdown, metadata = convert_with_metadata(html, options, config)
The metadata dictionary contains five categories:
metadata = {
"document": { # Document-level metadata from <head>
"title": str | None,
"description": str | None,
"keywords": list[str], # Comma-separated keywords from meta tags
"author": str | None,
"canonical_url": str | None, # link[rel="canonical"] href
"base_href": str | None,
"language": str | None, # lang attribute (e.g., "en")
"text_direction": str | None, # "ltr", "rtl", or "auto"
"open_graph": dict[str, str], # og:* meta properties
"twitter_card": dict[str, str], # twitter:* meta properties
"meta_tags": dict[str, str], # Other meta tag properties
},
"headers": [ # h1-h6 elements with hierarchy
{
"level": int, # 1-6
"text": str, # Normalized text content
"id": str | None, # HTML id attribute
"depth": int, # Nesting depth in document tree
"html_offset": int, # Byte offset in original HTML
},
# ... more headers
],
"links": [ # Extracted <a> elements
{
"href": str,
"text": str,
"title": str | None,
"link_type": str, # "anchor" | "internal" | "external" | "email" | "phone" | "other"
"rel": list[str], # rel attribute values
"attributes": dict[str, str], # Other HTML attributes
},
# ... more links
],
"images": [ # Extracted <img> elements
{
"src": str, # Image source (URL or data URI)
"alt": str | None,
"title": str | None,
"dimensions": tuple[int, int] | None, # (width, height)
"image_type": str, # "data_uri" | "inline_svg" | "external" | "relative"
"attributes": dict[str, str],
},
# ... more images
],
"structured_data": [ # JSON-LD, Microdata, RDFa blocks
{
"data_type": str, # "json_ld" | "microdata" | "rdfa"
"raw_json": str, # JSON string representation
"schema_type": str | None, # Detected schema type (e.g., "Article")
},
# ... more structured data
],
}
Extract Article Metadata for SEO
from html_to_markdown import convert_with_metadata
def extract_article_metadata(html: str) -> dict:
markdown, metadata = convert_with_metadata(html)
doc = metadata["document"]
return {
"title": doc.get("title"),
"description": doc.get("description"),
"keywords": doc.get("keywords", []),
"author": doc.get("author"),
"canonical_url": doc.get("canonical_url"),
"language": doc.get("language"),
"open_graph": doc.get("open_graph", {}),
"twitter_card": doc.get("twitter_card", {}),
"markdown": markdown,
}
# Usage
seo_data = extract_article_metadata(html)
print(f"Title: {seo_data['title']}")
print(f"Language: {seo_data['language']}")
print(f"OG Image: {seo_data['open_graph'].get('image')}")
Build Table of Contents
from html_to_markdown import convert_with_metadata
def build_table_of_contents(html: str) -> list[dict]:
"""Generate a nested TOC from header structure."""
markdown, metadata = convert_with_metadata(html)
headers = metadata["headers"]
toc = []
for header in headers:
toc.append({
"level": header["level"],
"text": header["text"],
"anchor": header.get("id") or header["text"].lower().replace(" ", "-"),
})
return toc
# Usage
toc = build_table_of_contents(html)
for item in toc:
indent = " " * (item["level"] - 1)
print(f"{indent}- [{item['text']}](#{item['anchor']})")
Validate Links and Accessibility
from html_to_markdown import convert_with_metadata
def check_accessibility(html: str) -> dict:
"""Find common accessibility and SEO issues."""
markdown, metadata = convert_with_metadata(html)
return {
"images_without_alt": [
img for img in metadata["images"]
if not img.get("alt")
],
"links_without_text": [
link for link in metadata["links"]
if not link.get("text", "").strip()
],
"external_links_count": len([
link for link in metadata["links"]
if link["link_type"] == "external"
]),
"broken_anchors": [
link for link in metadata["links"]
if link["link_type"] == "anchor"
],
}
# Usage
issues = check_accessibility(html)
if issues["images_without_alt"]:
print(f"Found {len(issues['images_without_alt'])} images without alt text")
Extract Structured Data (JSON-LD, Microdata)
from html_to_markdown import convert_with_metadata
import json
def extract_json_ld_schemas(html: str) -> list[dict]:
"""Extract all JSON-LD structured data blocks."""
markdown, metadata = convert_with_metadata(html)
schemas = []
for block in metadata["structured_data"]:
if block["data_type"] == "json_ld":
try:
schema = json.loads(block["raw_json"])
schemas.append({
"type": block.get("schema_type"),
"data": schema,
})
except json.JSONDecodeError:
continue
return schemas
# Usage
schemas = extract_json_ld_schemas(html)
for schema in schemas:
print(f"Found {schema['type']} schema:")
print(json.dumps(schema["data"], indent=2))
Migrate Content with Preservation of Links and Images
from html_to_markdown import convert_with_metadata
def migrate_with_manifest(html: str, base_url: str) -> tuple[str, dict]:
"""Convert to Markdown while capturing all external references."""
markdown, metadata = convert_with_metadata(html)
manifest = {
"title": metadata["document"].get("title"),
"external_links": [
{"url": link["href"], "text": link["text"]}
for link in metadata["links"]
if link["link_type"] == "external"
],
"external_images": [
{"url": img["src"], "alt": img.get("alt")}
for img in metadata["images"]
if img["image_type"] == "external"
],
}
return markdown, manifest
# Usage
md, manifest = migrate_with_manifest(html, "https://example.com")
print(f"Converted: {manifest['title']}")
print(f"External resources: {len(manifest['external_links'])} links, {len(manifest['external_images'])} images")
Check if metadata extraction is available at runtime:
from html_to_markdown import convert_with_metadata, convert
try:
# Try to use metadata extraction
markdown, metadata = convert_with_metadata(html)
print(f"Metadata available: {metadata['document'].get('title')}")
except (NameError, TypeError):
# Fallback for builds without metadata feature
markdown = convert(html)
print("Metadata feature not available, using basic conversion")
Metadata extraction is designed to be robust:
from html_to_markdown import convert_with_metadata, MetadataConfig
# Handle large structured data safely
config = MetadataConfig(
extract_structured_data=True,
max_structured_data_size=500_000, # 500KB limit
)
try:
markdown, metadata = convert_with_metadata(html, metadata_config=config)
# Safe access with defaults
title = metadata["document"].get("title", "Untitled")
headers = metadata["headers"] or []
images = metadata["images"] or []
except Exception as e:
# Handle parsing errors gracefully
print(f"Extraction error: {e}")
# Fallback to basic conversion
from html_to_markdown import convert
markdown = convert(html)
MetadataConfig to reduce overhead.from html_to_markdown import MetadataConfig, convert_with_metadata
# Optimize for performance
config = MetadataConfig(
extract_headers=True,
extract_links=False, # Skip if not needed
extract_images=False, # Skip if not needed
extract_structured_data=False, # Skip if not needed
)
markdown, metadata = convert_with_metadata(html, metadata_config=config)
When extract_metadata=True (default in ConversionOptions), basic metadata is embedded in a YAML frontmatter block:
from html_to_markdown import convert, ConversionOptions
# Basic metadata as YAML frontmatter
options = ConversionOptions(extract_metadata=True)
markdown = convert(html, options)
# Output: "---\ntitle: ...\n---\n\nContent..."
# Rich metadata extraction (all metadata types)
from html_to_markdown import convert_with_metadata
markdown, full_metadata = convert_with_metadata(html)
# Returns structured data dict with headers, links, images, etc.
The two approaches serve different purposes:
extract_metadata=True: Embeds basic metadata in the output Markdownconvert_with_metadata(): Returns structured metadata for programmatic accessfrom html_to_markdown import ConversionOptions, convert
# Default: emit structured Markdown directly
markdown = convert(hocr_html)
# hOCR documents are detected automatically; tables are reconstructed without extra configuration.
markdown = convert(hocr_html)
pipx install html-to-markdown # or: pip install html-to-markdown
html-to-markdown page.html > page.md
cat page.html | html-to-markdown --heading-style atx > page.md
ConversionOptionsKey fields (see docstring for full matrix):
heading_style: "underlined" | "atx" | "atx_closed"list_indent_width: spaces per indent level (default 2)bullets: cycle of bullet characters ("*+-")strong_em_symbol: "*" or "_"code_language: default fenced code block languagewrap, wrap_width: wrap Markdown outputstrip_tags: remove specific HTML tagspreprocessing: PreprocessingOptionsencoding: input character encoding (informational)PreprocessingOptionsenabled: enable HTML sanitisation (default: True since v2.4.2 for robust malformed HTML handling)preset: "minimal" | "standard" | "aggressive" (default: "standard")remove_navigation: remove navigation elements (default: True)remove_forms: remove form elements (default: True)Note: As of v2.4.2, preprocessing is enabled by default to ensure robust handling of malformed HTML (e.g., bare angle brackets like 1<2 in content). Set enabled=False if you need minimal preprocessing.
InlineImageConfigmax_decoded_size_bytes: reject larger payloadsfilename_prefix: generated name prefix (embedded_image default)capture_svg: collect inline <svg> (default True)infer_dimensions: decode raster images to obtain dimensions (default False)The v2 API (convert()) is strongly recommended for all code. The v1 compatibility layer adds significant overhead and should only be used for gradual migration:
# ✅ RECOMMENDED - V2 Direct API (Fast)
from html_to_markdown import convert, ConversionOptions
markdown = convert(html) # Simple conversion - FAST
markdown = convert(html, ConversionOptions(heading_style="atx")) # With options - FAST
# ❌ AVOID - V1 Compatibility Layer (Slow)
from html_to_markdown import convert_to_markdown
markdown = convert_to_markdown(html, heading_style="atx") # Adds 77% overhead
Benchmarked on Apple M4 with 25-paragraph HTML document:
| API | ops/sec | Relative Performance | Recommendation |
|---|---|---|---|
V2 API (convert()) | 129,822 | baseline | ✅ Use this |
| V1 Compat Layer | 67,673 | 77% slower | ⚠️ Migration only |
| CLI | 150-210 MB/s | Fastest | ✅ Batch processing |
The v1 compatibility layer creates extra Python objects and performs additional conversions, significantly impacting performance.
convert()): All new code, production systems, performance-critical applications ← Use thisconvert_to_markdown()): Only for gradual migration from legacy codebaseshtml-to-markdown): Batch processing, shell scripts, maximum throughputA compatibility layer is provided to ease migration from v1.x:
html_to_markdown.v1_compat exposes convert_to_markdown, convert_to_markdown_stream, and markdownify. Keyword mappings are listed in the changelog.html-to-markdown --help.code_language_callback, strip, and streaming APIs were removed; use ConversionOptions, PreprocessingOptions, and the inline-image helpers instead.MIT License – see LICENSE.
If you find this library useful, consider sponsoring the project.
FAQs
High-performance HTML to Markdown converter powered by Rust with a clean Python API
We found that html-to-markdown demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Socket CEO Feross Aboukhadijeh joins Software Engineering Daily to discuss modern software supply chain attacks and rising AI-driven security risks.

Security News
GitHub has revoked npm classic tokens for publishing; maintainers must migrate, but OpenJS warns OIDC trusted publishing still has risky gaps for critical projects.

Security News
Rust’s crates.io team is advancing an RFC to add a Security tab that surfaces RustSec vulnerability and unsoundness advisories directly on crate pages.