Socket
Book a DemoInstallSign in
Socket

@omer-go/docx-parser-converter-ts

Package Overview
Dependencies
Maintainers
1
Versions
2
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@omer-go/docx-parser-converter-ts

A TypeScript library to convert DOCX files to WYSIWYG HTML or plain text formats while preserving styles.

0.0.2
latest
Source
npmnpm
Version published
Weekly downloads
0
Maintainers
1
Weekly downloads
Β 
Created
Source

Docx Parser and Converter (TypeScript/JavaScript) πŸ“„βœ¨

A powerful TypeScript library for converting DOCX documents into HTML and plain text, with detailed parsing of document properties and styles. This project is based on a Python version.

Table of Contents

Introduction 🌟

Welcome to the Docx Parser and Converter for TypeScript/JavaScript! This library allows you to easily convert DOCX documents into HTML and plain text formats, extracting detailed properties and styles.

Project Overview πŸ› οΈ

The project is structured to parse DOCX files, convert their content into structured data models, and provide conversion utilities to transform this data into HTML or plain text.

Key Features 🌟

  • Convert DOCX documents to HTML or plain text.
  • Parse and extract detailed document properties and styles.
  • Structured data representation for easy manipulation.

⚠️ Important Note on Environment Compatibility

The current version (0.0.1) of this package is primarily designed and tested for browser environments.

While efforts are underway to ensure full Node.js compatibility, using this version in a Node.js environment might lead to errors (such as document is not defined or Buffer is not defined) because some underlying dependencies or utility functions currently rely on browser-specific APIs.

For browser usage, the library should function as expected. Node.js support will be improved in future releases.

Installation πŸ’Ύ

To install the library, you can use npm or yarn:

npm install @omer-go/docx-parser-converter-ts
# or
yarn add @omer-go/docx-parser-converter-ts

Usage πŸš€

Importing the Library

ES Modules (Recommended for modern browsers and bundlers):

import { DocxToHtmlConverter, DocxToTxtConverter } from '@omer-go/docx-parser-converter-ts';

UMD (for direct use in browsers via <script> tag): If you include the UMD build (dist/docx-parser-converter.umd.js), the library will be available on the global window.DocxParserConverter object:

<script src="path/to/node_modules/@omer-go/docx-parser-converter-ts/dist/docx-parser-converter.umd.js"></script>
<script>
  const { DocxToHtmlConverter, DocxToTxtConverter } = window.DocxParserConverter;
  // ... use them ...
</script>

Quick Start Guide (Browser) πŸ“–

This example demonstrates usage with a file input in a browser.

  • HTML Setup:

    <input type="file" id="docxFile" accept=".docx" />
    <button onclick="handleConvert()">Convert</button>
    <div id="htmlOutput"></div>
    <pre id="textOutput"></pre>
    
  • JavaScript for Conversion:

    // Assuming you've imported or accessed the converters as shown above
    
    async function handleConvert() {
        const fileInput = document.getElementById('docxFile');
        const htmlOutputDiv = document.getElementById('htmlOutput');
        const textOutputPre = document.getElementById('textOutput');
    
        if (!fileInput.files || fileInput.files.length === 0) {
            alert('Please select a DOCX file.');
            return;
        }
        const file = fileInput.files[0];
    
        try {
            const arrayBuffer = await file.arrayBuffer(); // DOCX content as ArrayBuffer
    
            // Convert to HTML
            const htmlConverter = await DocxToHtmlConverter.create(arrayBuffer, { useDefaultValues: true });
            const htmlResult = htmlConverter.convertToHtml();
            htmlOutputDiv.innerHTML = htmlResult;
    
            // Convert to Plain Text
            const txtConverter = await DocxToTxtConverter.create(arrayBuffer, { useDefaultValues: true });
            const txtResult = txtConverter.convertToTxt({ indent: true });
            textOutputPre.textContent = txtResult;
    
        } catch (error) {
            console.error("Conversion error:", error);
            alert("Error during conversion: " + error.message);
        }
    }
    

Supported XML Parsing Types πŸ“„

The Docx Parser and Converter library supports parsing various XML components within a DOCX file. Below is a detailed list of the supported and unsupported components:

Supported Components

  • document.xml:

    • Document Parsing: Parses the main document structure.
    • Paragraphs: Extracts paragraphs and their properties.
    • Runs: Extracts individual text runs within paragraphs.
    • Tables: Parses table structures and properties.
    • Table Rows: Extracts rows within tables.
    • Table Cells: Extracts cells within rows.
    • List Items: Handles both bulleted and numbered lists through paragraph properties.
  • numbering.xml:

    • Numbering Definitions: Parses numbering definitions and properties for lists.
    • Numbering Levels: Extracts different levels of numbering for nested lists.
  • styles.xml:

    • Paragraph Styles: Extracts styles applied to paragraphs.
    • Run Styles: Extracts styles applied to text runs.
    • Table Styles: Parses styles applied to tables and table elements.
    • Default Styles: Extracts default document styles for paragraphs, runs, and tables.

Unsupported Components (Current Version)

  • Images: Parsing and extraction of images embedded within the document.
  • Headers and Footers: Parsing of headers and footers content.
  • Footnotes and Endnotes: Handling footnotes and endnotes within the document.
  • Comments: Extraction and handling of comments.
  • Custom XML Parts: Any custom XML parts beyond the standard DOCX schema.
  • More complex OOXML features (e.g., complex fields, VML graphics, certain drawing elements).

General Code Flow πŸ”„

The Docx Parser and Converter library follows a structured workflow to parse, convert, and merge document properties and styles according to DOCX specifications. Here’s a detailed overview of the technical process:

  • Parsing XML Files:

    • The library first unzips the DOCX file (which is a zip archive) and reads essential XML parts like word/document.xml, word/styles.xml, and word/numbering.xml.
    • Specialized parsers process these XML files:
      • DocumentParser extracts the main document structure (paragraphs, tables, runs) into structured models.
      • NumberingParser extracts numbering definitions and levels.
      • StylesParser extracts styles for paragraphs, runs, tables, and document defaults.
  • Property and Style Merging:

    • Hierarchical Style Application: Styles are applied to paragraphs and runs based on a defined hierarchy (direct formatting > character style > paragraph style > linked style > document defaults).
    • Default Style Application: If no specific style is applied, default styles are used.
    • Efficient Property Merging: Properties are merged efficiently to determine the final computed style for each element.
  • Conversion to HTML and TXT:

    • DOCX to HTML:
      • The DocxToHtmlConverter takes the parsed document models and converts the elements into HTML format.
      • Styles and properties are translated into equivalent HTML tags and inline CSS attributes.
      • WYSIWYG-like Support: The conversion aims to maintain the visual representation of the document, including numbering, margins, and indentations.
    • DOCX to TXT:
      • The DocxToTxtConverter converts the document models into plain text format.
      • Paragraphs, lists, and tables are transformed into a readable plain text representation.
      • Structure Preservation: The conversion attempts to preserve the document's structure, maintaining numbering and indentations for readability.

This process ensures accurate parsing and conversion while preserving the original document's structure and style as much as possible within the supported features.

Conversion Table of DOCX XML Elements to HTML

XML ElementHTML ElementNotes
w:ppParagraph element
w:rspanRun element, used for inline text formatting
w:tbltableTable element
w:trtrTable row
w:tctdTable cell
w:tblGridcolgroupTable grid, converted to colgroup for column definitions
w:gridColcolGrid column, converted to col for column width
w:tblPrtableTable properties
w:tblWtable style="width:Xpt;"Table width, converted using CSS width property (approx.)
w:tblBorderstable, tr, td style="border:X;"Table borders, converted using CSS border property
w:tblCellMartd style="padding:Xpt;"Table cell margins, converted using CSS padding property
w:bb or strong or CSS font-weightBold text
w:ii or em or CSS font-styleItalic text
w:uspan style="text-decoration:underline;"Underline text, converted using CSS text-decoration property
w:colorspan style="color:#RRGGBB;"Text color, converted using CSS color property
w:szspan style="font-size:Xpt;"Text size, converted using CSS font-size property (in points)
w:jcp style="text-align:left|center|right|justify;"Text alignment, converted using CSS text-align property
w:indp style="margin-left:Xpt; text-indent:Xpt;"Indentation, converted using CSS margin and text-indent
w:spacingp style="line-height:X; margin-top:Ypt; margin-bottom:Zpt;"Line/paragraph spacing, converted using CSS properties
w:highlightspan style="background-color:#RRGGBB;"Text highlight, converted using CSS background-color property
w:shdspan style="background-color:#RRGGBB;"Shading, converted using CSS background-color property
w:vertAlignspan style="vertical-align:super|sub;"Vertical alignment (superscript/subscript)
w:pgMarbody/div style="padding: Xpt;"Page margins, applied to a wrapper div or body
w:rFontsspan style="font-family:'font-name';"Font name, converted using CSS font-family property
w:tabspan (with calculated width)Tab characters, converted to spans with appropriate spacing
Numberingol, ul, li with CSS for stylingList items with various numbering/bullet styles

API Reference πŸ“œ (Coming Soon)

Detailed API documentation will be made available soon. For now, please refer to the exported classes and their methods:

  • DocxToHtmlConverter
    • static async create(docxFile: ArrayBuffer | Uint8Array | File | Blob, options?: DocxToHtmlOptions): Promise<DocxToHtmlConverter>
    • convertToHtml(): string
  • DocxToTxtConverter
    • static async create(docxFile: ArrayBuffer | Uint8Array | File | Blob, options?: DocxToTxtOptions): Promise<DocxToTxtConverter>
    • convertToTxt(options?: { indent?: boolean }): string

Interfaces for options (DocxToHtmlOptions, DocxToTxtOptions) are also exported.

Enjoy using Docx Parser and Converter! πŸš€βœ¨

Keywords

docx

FAQs

Package last updated on 05 Jun 2025

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

About

Packages

Stay in touch

Get open source security insights delivered straight into your inbox.

  • Terms
  • Privacy
  • Security

Made with ⚑️ by Socket Inc

U.S. Patent No. 12,346,443 & 12,314,394. Other pending.