Latest Threat Research:SANDWORM_MODE: Shai-Hulud-Style npm Worm Hijacks CI Workflows and Poisons AI Toolchains.Details
Socket
Book a DemoInstallSign in
Socket

docx-parser-converter

Package Overview
Dependencies
Maintainers
1
Versions
10
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

docx-parser-converter

A library for converting DOCX files to HTML and plain text

pipPyPI
Version
1.0.3
Maintainers
1

DOCX Parser Converter - Python

Python implementation of the DOCX parser and converter. Built with Python 3.10+, Pydantic models, and lxml.

For installation and quick start, see the main README.

⚠️ Breaking Changes in v1.0.0

Version 1.0.0 introduces a completely rewritten API. If you're upgrading from a previous version, please read the CHANGELOG.md for the full migration guide.

Quick Migration

Old API (deprecated):

from docx_parser_converter.docx_parsers.utils import read_binary_from_file_path
from docx_parser_converter.docx_to_html.docx_to_html_converter import DocxToHtmlConverter

docx_content = read_binary_from_file_path("document.docx")
converter = DocxToHtmlConverter(docx_content)
html = converter.convert_to_html()

New API (recommended):

from docx_parser_converter import docx_to_html

html = docx_to_html("document.docx")

The old API still works but emits deprecation warnings. It will be removed in a future version.

Configuration

Use ConversionConfig to customize the conversion:

from docx_parser_converter import ConversionConfig, docx_to_html, docx_to_text

# HTML conversion options
config = ConversionConfig(
    # HTML-specific options
    title="My Document",           # Document title in <title> tag
    language="en",                 # HTML lang attribute
    style_mode="inline",           # "inline", "class", or "none"
    use_semantic_tags=False,       # Use CSS spans (False) vs <strong>, <em> (True)
    fragment_only=False,           # Output just content without HTML wrapper
    custom_css="body { margin: 2em; }",  # Custom CSS to include
    responsive=True,               # Include viewport meta tag

    # Text-specific options
    text_formatting="plain",       # "plain" or "markdown"
    table_mode="auto",             # "auto", "ascii", "tabs", or "plain"
    paragraph_separator="\n\n",    # Separator between paragraphs
)

html = docx_to_html("document.docx", config=config)
text = docx_to_text("document.docx", config=config)

Configuration Options

HTML Options

OptionTypeDefaultDescription
style_mode"inline" | "class" | "none""inline"How to output CSS styles
use_semantic_tagsboolFalseUse semantic tags (<strong>, <em>) vs CSS spans
preserve_whitespaceboolFalsePreserve whitespace in content
titlestr""Document title for HTML output
languagestr"en"HTML lang attribute
fragment_onlyboolFalseOutput only content, no HTML wrapper
custom_cssstr | NoneNoneCustom CSS to include
css_fileslist[str][]External CSS files to reference
responsiveboolTrueInclude viewport meta tag
include_print_stylesboolFalseInclude print media query styles

Text Options

OptionTypeDefaultDescription
text_formatting"plain" | "markdown""plain"Output format
table_mode"auto" | "ascii" | "tabs" | "plain""auto"Table rendering mode
paragraph_separatorstr"\n\n"Separator between paragraphs
preserve_empty_paragraphsboolTruePreserve empty paragraphs

Table Rendering Modes

  • auto: Automatically selects ASCII for tables with visible borders, tabs for others
  • ascii: ASCII box drawing characters (+, -, |)
  • tabs: Tab-separated columns
  • plain: Space-separated columns

Example ASCII table output:

+----------+----------+
| Header 1 | Header 2 |
+----------+----------+
| Cell 1   | Cell 2   |
+----------+----------+

Markdown Formatting

When using text_formatting="markdown", formatting is preserved:

config = ConversionConfig(text_formatting="markdown")
text = docx_to_text("document.docx", config=config)

# Output: "This is **bold** and *italic* text."

Input Types

The library accepts multiple input types:

from pathlib import Path
from io import BytesIO

# File path as string
html = docx_to_html("document.docx")

# File path as Path object
html = docx_to_html(Path("document.docx"))

# Bytes content
with open("document.docx", "rb") as f:
    content = f.read()
html = docx_to_html(content)

# File-like object
with open("document.docx", "rb") as f:
    html = docx_to_html(f)

# None returns empty output
html = docx_to_html(None)  # Returns empty HTML document
text = docx_to_text(None)  # Returns ""

Supported DOCX Elements

Text Formatting

  • Bold, italic, underline, strikethrough
  • Subscript, superscript
  • Highlight colors
  • Font family, size, and color
  • All caps, small caps
  • Various underline styles (single, double, dotted, dashed, wave, etc.) with color support

Paragraph Formatting

  • Alignment (left, center, right, justify)
  • Indentation (left, right, first line, hanging)
  • Spacing (before, after, line spacing)
  • Borders and shading
  • Keep with next, keep lines together, page break before

Lists and Numbering

  • Bullet lists
  • Numbered lists (decimal, roman, letters, ordinal)
  • Multi-level lists with various formats
  • List restart and override support

Tables

  • Simple and complex tables
  • Cell merging (horizontal and vertical)
  • Full border support (outer borders, inside grid lines, per-cell borders)
  • Cell-level border overrides (tcBorders override tblBorders)
  • Cell shading and backgrounds
  • Column widths and table alignment

Other Elements

  • Hyperlinks (external URLs resolved from relationships)
  • Line breaks and page breaks
  • Tab characters
  • Special characters (soft hyphen, non-breaking hyphen)

Error Handling

The library provides specific exceptions for different error cases:

from docx_parser_converter import docx_to_html

try:
    html = docx_to_html("document.docx")
except FileNotFoundError:
    print("File not found")
except ValueError as e:
    print(f"Invalid DOCX: {e}")
except Exception as e:
    print(f"Error: {e}")

Image Format Support

Images are extracted from DOCX files and embedded in HTML as base64 data URLs. Browser rendering support varies by format:

FormatExtensionsBrowser Support
PNG.png✅ Full
JPEG.jpg, .jpeg✅ Full
GIF.gif✅ Full (including animation)
WebP.webp✅ Full
SVG.svg✅ Full
BMP.bmp✅ Full
TIFF.tif, .tiff⚠️ Safari only
EMF.emf❌ Not supported
WMF.wmf❌ Not supported

Notes:

  • TIFF images will only display in Safari; other browsers will show a broken image
  • EMF/WMF are Windows vector formats that browsers cannot render natively
  • Images in plain text output are skipped (no alt text placeholders)

Known Limitations

Not Currently Supported

  • Headers and footers: Document headers/footers are not included
  • Footnotes and endnotes: These are not extracted
  • Comments and track changes: Revision marks are not processed
  • OLE objects: Embedded Excel charts, etc. are not supported
  • Text boxes: Floating text boxes and shapes are not extracted
  • Complex field codes: Most field codes besides hyperlinks
  • RTL/BiDi text: Right-to-left text may not render correctly
  • Password-protected files: Encrypted documents cannot be opened

Partial Support

  • Styles: Style inheritance works but complex conditional formatting is limited
  • Themes: Theme colors and fonts are not resolved
  • Custom XML: Custom document properties are not extracted
  • Sections: Section properties (columns, page size) affect content but aren't fully rendered

Development

Setup

# Clone the repository
git clone https://github.com/omer-go/docx-parser-converter.git
cd docx-parser-converter/docx_parser_converter_python

# Install PDM (if not already installed)
pip install pdm

# Install dependencies
pdm install

# Install dev dependencies
pdm install -G dev

Running Tests

# Run all tests
pdm run pytest

# Run with coverage
pdm run pytest --cov

# Run specific test file
pdm run pytest tests/unit/test_api.py

Type Checking

pdm run pyright

Linting

pdm run ruff check .
pdm run ruff format .

Project Structure

docx_parser_converter_python/
├── api.py              # Public API (docx_to_html, docx_to_text, ConversionConfig)
├── core/               # Core utilities
│   ├── docx_reader.py  # DOCX file opening and validation
│   ├── xml_extractor.py # XML content extraction
│   ├── constants.py    # XML namespaces and paths
│   └── exceptions.py   # Custom exceptions
├── models/             # Pydantic models
│   ├── common/         # Shared models (Color, Border, Spacing, etc.)
│   ├── document/       # Document models (Paragraph, Run, Table, etc.)
│   ├── numbering/      # Numbering definitions
│   └── styles/         # Style definitions
├── parsers/            # XML to Pydantic conversion
│   ├── common/         # Common element parsers
│   ├── document/       # Document element parsers
│   ├── numbering/      # Numbering parsers
│   └── styles/         # Style parsers
├── converters/         # Model to output conversion
│   ├── common/         # Style resolution, numbering tracking
│   ├── html/           # HTML conversion
│   └── text/           # Text conversion
└── tests/              # Test suite
    ├── unit/           # Unit tests
    ├── integration/    # Integration tests
    └── fixtures/       # Test DOCX files

Architecture

The library follows a three-phase conversion process:

  • Parse: DOCX XML → Pydantic models

    • Open and validate DOCX file
    • Extract document.xml, styles.xml, numbering.xml
    • Parse XML to strongly-typed Pydantic models
  • Resolve: Apply style inheritance

    • Merge document defaults → style chain → direct formatting
    • Track numbering counters for lists
  • Convert: Models → Output format

    • HTML: Generate semantic HTML with CSS
    • Text: Extract plain text with optional Markdown

License

MIT License

Contributing

Contributions are welcome! Please see the CONTRIBUTING.md for guidelines.

Keywords

docx

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts