
Research
Two Malicious Rust Crates Impersonate Popular Logger to Steal Wallet Keys
Socket uncovers malicious Rust crates impersonating fast_log to steal Solana and Ethereum wallet keys from source code.
docx-parser-converter
Advanced tools
A powerful library for converting DOCX documents into HTML and plain text, with detailed parsing of document properties and styles.
Welcome to the Docx Parser and Converter project! This library allows you to easily convert DOCX documents into HTML and plain text formats, extracting detailed properties and styles using Pydantic models.
The project is structured to parse DOCX files, convert their content into structured data using Pydantic models, and provide conversion utilities to transform this data into HTML or plain text.
To install the library, you can use pip. (Add the pip install command manually)
pip install docx-parser-converter
To start using the library, import the necessary modules:
from docx_parser_converter.docx_to_html.docx_to_html_converter import DocxToHtmlConverter
from docx_parser_converter.docx_to_txt.docx_to_txt_converter import DocxToTxtConverter
from docx_parser_converter.docx_parsers.utils import read_binary_from_file_path
Convert to HTML:
from docx_parser_converter.docx_to_html.docx_to_html_converter import DocxToHtmlConverter
from docx_parser_converter.docx_parsers.utils import read_binary_from_file_path
docx_path = "path_to_your_docx_file.docx"
html_output_path = "output.html"
docx_file_content = read_binary_from_file_path(docx_path)
converter = DocxToHtmlConverter(docx_file_content, use_default_values=True)
html_output = converter.convert_to_html()
converter.save_html_to_file(html_output, html_output_path)
Convert to Plain Text:
from docx_parser_converter.docx_to_txt.docx_to_txt_converter import DocxToTxtConverter
from docx_parser_converter.docx_parsers.utils import read_binary_from_file_path
docx_path = "path_to_your_docx_file.docx"
txt_output_path = "output.txt"
docx_file_content = read_binary_from_file_path(docx_path)
converter = DocxToTxtConverter(docx_file_content, use_default_values=True)
txt_output = converter.convert_to_txt(indent=True)
converter.save_txt_to_file(txt_output, txt_output_path)
The Docx Parser and Converter library supports parsing various XML components within a DOCX file. Below is a detailed list of the supported and unsupported components:
document.xml:
numbering.xml:
styles.xml:
The Docx Parser and Converter library follows a structured workflow to parse, convert, and merge document properties and styles according to DOCX specifications. Here’s a detailed overview of the technical process:
Parsing XML Files:
DocumentParser class reads and parses the document.xml file to extract the document structure, including paragraphs, tables, and runs. This data is converted into DocumentSchema Pydantic models.NumberingParser class parses the numbering.xml file to extract numbering definitions and levels, converting them into NumberingSchema Pydantic models.StylesParser class parses the styles.xml file to extract styles for paragraphs, runs, and tables, converting them into StylesSchema Pydantic models.Property and Style Merging:
DocumentSchema remain unchanged, while styles are applied based on the style_id if present.style_id is present, default styles from StyleDefaults are applied. Finally, any remaining null properties are filled with default_rpr and default_ppr from the StylesSchema.merge_properties function is used to efficiently merge properties by converting Pydantic models to dictionaries, adding only non-null properties, and reassigning them to the original models.Conversion to HTML and TXT:
DocxToHtmlConverter class takes the parsed DocumentSchema and converts the document elements into HTML format.save_html_to_file method.DocxToTxtConverter class converts the DocumentSchema into plain text format.save_txt_to_file method.This detailed process ensures that the Docx Parser and Converter library accurately parses and converts DOCX documents while preserving the original document's structure and style as much as possible.
| XML Element | HTML Element | Notes |
|---|---|---|
| w:p | p | Paragraph element |
| w:r | span | Run element, used for inline text formatting |
| w:tbl | table | Table element |
| w:tr | tr | Table row |
| w:tc | td | Table cell |
| w:tblGrid | colgroup | Table grid, converted to colgroup for column definitions |
| w:gridCol | col | Grid column, converted to col for column width |
| w:tblPr | table | Table properties |
| w:tblW | table style="width:X%;" | Table width, converted using CSS width property |
| w:tblBorders | table style="border:X;" | Table borders, converted using CSS border property |
| w:tblCellMar | td style="padding:Xpt;" | Table cell margins, converted using CSS padding property |
| w:tblCellSpacing | table style="border-spacing:Xpt;" | Cell spacing, converted using CSS border-spacing property |
| w:b | b | Bold text |
| w:i | i | Italic text |
| w:u | span style="text-decoration:underline;" | Underline text, converted using CSS text-decoration property |
| w:color | span style="color:#RRGGBB;" | Text color, converted using CSS color property |
| w:sz | span style="font-size:Xpt;" | Text size, converted using CSS font-size property (in points) |
| w:jc | p style="text-align:left | center |
| w:ind | p style="margin-left:Xpt;" | Regular indent, converted using CSS margin-left property |
| w:ind | p style="text-indent:Xpt;" | Hanging/first-line indent, converted using CSS text-indent property |
| w:spacing | p style="line-height:X%;" | Line spacing, converted using CSS line-height property |
| w:highlight | span style="background-color:#RRGGBB;" | Text highlight, converted using CSS background-color property |
| w:shd | span style="background-color:#RRGGBB;" | Shading, converted using CSS background-color property |
| w:vertAlign | span style="vertical-align:super/sub;" | Vertical alignment, converted using CSS vertical-align property |
| w:pgMar | div style="padding: Xpt;" | Margins, converted using CSS padding property |
| w:rFonts | span style="font-family:'font-name';" | Font name, converted using CSS font-family property |



For detailed API documentation, please visit our Read the Docs page.
Enjoy using Docx Parser and Converter! 🚀✨
FAQs
A library for converting DOCX documents to HTML and plain text
We found that docx-parser-converter demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Research
Socket uncovers malicious Rust crates impersonating fast_log to steal Solana and Ethereum wallet keys from source code.

Research
A malicious package uses a QR code as steganography in an innovative technique.

Research
/Security News
Socket identified 80 fake candidates targeting engineering roles, including suspected North Korean operators, exposing the new reality of hiring as a security function.