
Research
SANDWORM_MODE: Shai-Hulud-Style npm Worm Hijacks CI Workflows and Poisons AI Toolchains
An emerging npm supply chain attack that infects repos, steals CI secrets, and targets developer AI toolchains for further compromise.
docx-parser-converter
Advanced tools
Python implementation of the DOCX parser and converter. Built with Python 3.10+, Pydantic models, and lxml.
For installation and quick start, see the main README.
Version 1.0.0 introduces a completely rewritten API. If you're upgrading from a previous version, please read the CHANGELOG.md for the full migration guide.
Old API (deprecated):
from docx_parser_converter.docx_parsers.utils import read_binary_from_file_path
from docx_parser_converter.docx_to_html.docx_to_html_converter import DocxToHtmlConverter
docx_content = read_binary_from_file_path("document.docx")
converter = DocxToHtmlConverter(docx_content)
html = converter.convert_to_html()
New API (recommended):
from docx_parser_converter import docx_to_html
html = docx_to_html("document.docx")
The old API still works but emits deprecation warnings. It will be removed in a future version.
Use ConversionConfig to customize the conversion:
from docx_parser_converter import ConversionConfig, docx_to_html, docx_to_text
# HTML conversion options
config = ConversionConfig(
# HTML-specific options
title="My Document", # Document title in <title> tag
language="en", # HTML lang attribute
style_mode="inline", # "inline", "class", or "none"
use_semantic_tags=False, # Use CSS spans (False) vs <strong>, <em> (True)
fragment_only=False, # Output just content without HTML wrapper
custom_css="body { margin: 2em; }", # Custom CSS to include
responsive=True, # Include viewport meta tag
# Text-specific options
text_formatting="plain", # "plain" or "markdown"
table_mode="auto", # "auto", "ascii", "tabs", or "plain"
paragraph_separator="\n\n", # Separator between paragraphs
)
html = docx_to_html("document.docx", config=config)
text = docx_to_text("document.docx", config=config)
| Option | Type | Default | Description |
|---|---|---|---|
style_mode | "inline" | "class" | "none" | "inline" | How to output CSS styles |
use_semantic_tags | bool | False | Use semantic tags (<strong>, <em>) vs CSS spans |
preserve_whitespace | bool | False | Preserve whitespace in content |
title | str | "" | Document title for HTML output |
language | str | "en" | HTML lang attribute |
fragment_only | bool | False | Output only content, no HTML wrapper |
custom_css | str | None | None | Custom CSS to include |
css_files | list[str] | [] | External CSS files to reference |
responsive | bool | True | Include viewport meta tag |
include_print_styles | bool | False | Include print media query styles |
| Option | Type | Default | Description |
|---|---|---|---|
text_formatting | "plain" | "markdown" | "plain" | Output format |
table_mode | "auto" | "ascii" | "tabs" | "plain" | "auto" | Table rendering mode |
paragraph_separator | str | "\n\n" | Separator between paragraphs |
preserve_empty_paragraphs | bool | True | Preserve empty paragraphs |
auto: Automatically selects ASCII for tables with visible borders, tabs for othersascii: ASCII box drawing characters (+, -, |)tabs: Tab-separated columnsplain: Space-separated columnsExample ASCII table output:
+----------+----------+
| Header 1 | Header 2 |
+----------+----------+
| Cell 1 | Cell 2 |
+----------+----------+
When using text_formatting="markdown", formatting is preserved:
config = ConversionConfig(text_formatting="markdown")
text = docx_to_text("document.docx", config=config)
# Output: "This is **bold** and *italic* text."
The library accepts multiple input types:
from pathlib import Path
from io import BytesIO
# File path as string
html = docx_to_html("document.docx")
# File path as Path object
html = docx_to_html(Path("document.docx"))
# Bytes content
with open("document.docx", "rb") as f:
content = f.read()
html = docx_to_html(content)
# File-like object
with open("document.docx", "rb") as f:
html = docx_to_html(f)
# None returns empty output
html = docx_to_html(None) # Returns empty HTML document
text = docx_to_text(None) # Returns ""
The library provides specific exceptions for different error cases:
from docx_parser_converter import docx_to_html
try:
html = docx_to_html("document.docx")
except FileNotFoundError:
print("File not found")
except ValueError as e:
print(f"Invalid DOCX: {e}")
except Exception as e:
print(f"Error: {e}")
Images are extracted from DOCX files and embedded in HTML as base64 data URLs. Browser rendering support varies by format:
| Format | Extensions | Browser Support |
|---|---|---|
| PNG | .png | ✅ Full |
| JPEG | .jpg, .jpeg | ✅ Full |
| GIF | .gif | ✅ Full (including animation) |
| WebP | .webp | ✅ Full |
| SVG | .svg | ✅ Full |
| BMP | .bmp | ✅ Full |
| TIFF | .tif, .tiff | ⚠️ Safari only |
| EMF | .emf | ❌ Not supported |
| WMF | .wmf | ❌ Not supported |
Notes:
# Clone the repository
git clone https://github.com/omer-go/docx-parser-converter.git
cd docx-parser-converter/docx_parser_converter_python
# Install PDM (if not already installed)
pip install pdm
# Install dependencies
pdm install
# Install dev dependencies
pdm install -G dev
# Run all tests
pdm run pytest
# Run with coverage
pdm run pytest --cov
# Run specific test file
pdm run pytest tests/unit/test_api.py
pdm run pyright
pdm run ruff check .
pdm run ruff format .
docx_parser_converter_python/
├── api.py # Public API (docx_to_html, docx_to_text, ConversionConfig)
├── core/ # Core utilities
│ ├── docx_reader.py # DOCX file opening and validation
│ ├── xml_extractor.py # XML content extraction
│ ├── constants.py # XML namespaces and paths
│ └── exceptions.py # Custom exceptions
├── models/ # Pydantic models
│ ├── common/ # Shared models (Color, Border, Spacing, etc.)
│ ├── document/ # Document models (Paragraph, Run, Table, etc.)
│ ├── numbering/ # Numbering definitions
│ └── styles/ # Style definitions
├── parsers/ # XML to Pydantic conversion
│ ├── common/ # Common element parsers
│ ├── document/ # Document element parsers
│ ├── numbering/ # Numbering parsers
│ └── styles/ # Style parsers
├── converters/ # Model to output conversion
│ ├── common/ # Style resolution, numbering tracking
│ ├── html/ # HTML conversion
│ └── text/ # Text conversion
└── tests/ # Test suite
├── unit/ # Unit tests
├── integration/ # Integration tests
└── fixtures/ # Test DOCX files
The library follows a three-phase conversion process:
Parse: DOCX XML → Pydantic models
Resolve: Apply style inheritance
Convert: Models → Output format
MIT License
Contributions are welcome! Please see the CONTRIBUTING.md for guidelines.
FAQs
A library for converting DOCX files to HTML and plain text
We found that docx-parser-converter demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Research
An emerging npm supply chain attack that infects repos, steals CI secrets, and targets developer AI toolchains for further compromise.

Company News
Socket is proud to join the OpenJS Foundation as a Silver Member, deepening our commitment to the long-term health and security of the JavaScript ecosystem.

Security News
npm now links to Socket's security analysis on every package page. Here's what you'll find when you click through.