🚨 Shai-Hulud Strikes Again:834 Packages Compromised.Technical Analysis
Socket
Book a DemoInstallSign in
Socket

html-to-markdown

Package Overview
Dependencies
Maintainers
1
Versions
64
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

html-to-markdown

High-performance HTML to Markdown converter powered by Rust with a clean Python API

pipPyPI
Version
2.11.4
Maintainers
1

html-to-markdown

High-performance HTML to Markdown converter with a clean Python API (powered by a Rust core). The same engine also drives the Node.js, Ruby, PHP, and WebAssembly bindings, so rendered Markdown stays identical across runtimes. Wheels are published for Linux, macOS, and Windows.

Crates.io npm (node) npm (wasm) PyPI Packagist RubyGems Hex.pm NuGet Maven Central Go Reference License: MIT Discord

Installation

pip install html-to-markdown

Performance Snapshot

Apple M4 • Real Wikipedia documents • convert() (Python)

DocumentSizeLatencyThroughputDocs/sec
Lists (Timeline)129KB0.62ms208 MB/s1,613
Tables (Countries)360KB2.02ms178 MB/s495
Mixed (Python wiki)656KB4.56ms144 MB/s219

V1 averaged ~2.5 MB/s (Python/BeautifulSoup). V2's Rust engine delivers 60–80× higher throughput.

Benchmark Fixtures (Apple M4)

Pulled directly from tools/runtime-bench (task bench:bindings -- --language python) so they stay in lockstep with the Rust core:

DocumentSizeops/sec (Python)
Lists (Timeline)129 KB1,405
Tables (Countries)360 KB352
Medium (Python)657 KB158
Large (Rust)567 KB183
Small (Intro)463 KB223
hOCR German PDF44 KB2,991
hOCR Invoice4 KB23,500
hOCR Embedded Tables37 KB3,464

Re-run locally with task bench:bindings -- --language python --output tmp.json to compare against CI history.

Quick Start

from html_to_markdown import convert

html = """
<h1>Welcome</h1>
<p>This is <strong>fast</strong> Rust-powered conversion!</p>
<ul>
    <li>Blazing fast</li>
    <li>Type safe</li>
    <li>Easy to use</li>
</ul>
"""

markdown = convert(html)
print(markdown)

Configuration (v2 API)

from html_to_markdown import ConversionOptions, convert

options = ConversionOptions(
    heading_style="atx",
    list_indent_width=2,
    bullets="*+-",
)
options.escape_asterisks = True
options.code_language = "python"
options.extract_metadata = True

markdown = convert(html, options)

Reusing Parsed Options

Avoid re-parsing the same option dictionaries inside hot loops by building a reusable handle:

from html_to_markdown import ConversionOptions, convert_with_handle, create_options_handle

handle = create_options_handle(ConversionOptions(hocr_spatial_tables=False))

for html in documents:
    markdown = convert_with_handle(html, handle)

HTML Preprocessing

from html_to_markdown import ConversionOptions, PreprocessingOptions, convert

options = ConversionOptions(
    ...
)

preprocessing = PreprocessingOptions(
    enabled=True,
    preset="aggressive",
)

markdown = convert(scraped_html, options, preprocessing)

Inline Image Extraction

from html_to_markdown import InlineImageConfig, convert_with_inline_images

markdown, inline_images, warnings = convert_with_inline_images(
    '<p><img src="data:image/png;base64,...==" alt="Pixel" width="1" height="1"></p>',
    image_config=InlineImageConfig(max_decoded_size_bytes=1024, infer_dimensions=True),
)

if inline_images:
    first = inline_images[0]
    print(first["format"], first["dimensions"], first["attributes"])  # e.g. "png", (1, 1), {"width": "1"}

Each inline image is returned as a typed dictionary (bytes payload, metadata, and relevant HTML attributes). Warnings are human-readable skip reasons.

hOCR (HTML OCR) Support

from html_to_markdown import ConversionOptions, convert

# Default: emit structured Markdown directly
markdown = convert(hocr_html)

# hOCR documents are detected automatically; tables are reconstructed without extra configuration.
markdown = convert(hocr_html)

CLI (same engine)

pipx install html-to-markdown  # or: pip install html-to-markdown

html-to-markdown page.html > page.md
cat page.html | html-to-markdown --heading-style atx > page.md

API Surface

ConversionOptions

Key fields (see docstring for full matrix):

  • heading_style: "underlined" | "atx" | "atx_closed"
  • list_indent_width: spaces per indent level (default 2)
  • bullets: cycle of bullet characters ("*+-")
  • strong_em_symbol: "*" or "_"
  • code_language: default fenced code block language
  • wrap, wrap_width: wrap Markdown output
  • strip_tags: remove specific HTML tags
  • preprocessing: PreprocessingOptions
  • encoding: input character encoding (informational)

PreprocessingOptions

  • enabled: enable HTML sanitisation (default: True since v2.4.2 for robust malformed HTML handling)
  • preset: "minimal" | "standard" | "aggressive" (default: "standard")
  • remove_navigation: remove navigation elements (default: True)
  • remove_forms: remove form elements (default: True)

Note: As of v2.4.2, preprocessing is enabled by default to ensure robust handling of malformed HTML (e.g., bare angle brackets like 1<2 in content). Set enabled=False if you need minimal preprocessing.

InlineImageConfig

  • max_decoded_size_bytes: reject larger payloads
  • filename_prefix: generated name prefix (embedded_image default)
  • capture_svg: collect inline <svg> (default True)
  • infer_dimensions: decode raster images to obtain dimensions (default False)

Performance: V2 vs V1 Compatibility Layer

⚠️ Important: Always Use V2 API

The v2 API (convert()) is strongly recommended for all code. The v1 compatibility layer adds significant overhead and should only be used for gradual migration:

# ✅ RECOMMENDED - V2 Direct API (Fast)
from html_to_markdown import convert, ConversionOptions

markdown = convert(html)  # Simple conversion - FAST
markdown = convert(html, ConversionOptions(heading_style="atx"))  # With options - FAST

# ❌ AVOID - V1 Compatibility Layer (Slow)
from html_to_markdown import convert_to_markdown

markdown = convert_to_markdown(html, heading_style="atx")  # Adds 77% overhead

Performance Comparison

Benchmarked on Apple M4 with 25-paragraph HTML document:

APIops/secRelative PerformanceRecommendation
V2 API (convert())129,822baselineUse this
V1 Compat Layer67,67377% slower⚠️ Migration only
CLI150-210 MB/sFastest✅ Batch processing

The v1 compatibility layer creates extra Python objects and performs additional conversions, significantly impacting performance.

When to Use Each

  • V2 API (convert()): All new code, production systems, performance-critical applications ← Use this
  • V1 Compat (convert_to_markdown()): Only for gradual migration from legacy codebases
  • CLI (html-to-markdown): Batch processing, shell scripts, maximum throughput

v1 Compatibility

A compatibility layer is provided to ease migration from v1.x:

  • Compat shim: html_to_markdown.v1_compat exposes convert_to_markdown, convert_to_markdown_stream, and markdownify. Keyword mappings are listed in the changelog.
  • ⚠️ Performance warning: These compatibility functions add 77% overhead. Migrate to v2 API as soon as possible.
  • CLI: The Rust CLI replaces the old Python script. New flags are documented via html-to-markdown --help.
  • Removed options: code_language_callback, strip, and streaming APIs were removed; use ConversionOptions, PreprocessingOptions, and the inline-image helpers instead.

License

MIT License – see LICENSE.

Support

If you find this library useful, consider sponsoring the project.

Keywords

cli-tool

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts