Socket
Book a DemoInstallSign in
Socket

wizarddocx

Package Overview
Dependencies
Maintainers
1
Versions
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

wizarddocx

Text extraction from Microsoft Word files. Parses Word documents natively and can optionally run local OCR with Tesseract for embedded images or scanned pages. Supports page selection and bytes input. Legacy .doc is read-only and OCR is not available.

Source
pipPyPI
Version
1.0.0
Maintainers
1

wizarddocx Banner

Wizard Docx

PyPI - Version PyPI - Downloads/month License

WizardDocx is a Python library focused on text extraction from Microsoft Word documents.
It parses Word documents natively and can apply local OCR with Tesseract for embedded images or scanned pages inside 'docx'.
Legacy .doc is supported in read-only mode without OCR.

Contents

Installation

Requires Python 3.9+.

pip install wizarddocx

For OCR capabilities, ensure you have Tesseract installed on your system.

Quick start

import wizarddocx as wd

text = wd.extract_text("example.docx")
print(text)

Text extraction

Parameters

  • input_data: [str, bytes, Path]
  • extension: The file extension, required only if input_data is bytes.
  • pages: page selection for .docx.
    • Examples: 1, "1-3", [1, 3, "5-8"]
  • ocr: Enables OCR using Tesseract. Applies to DOCX and image-based files no for doc.
  • language_ocr: Language code for OCR. Defaults to 'eng'.

Examples

Basic:

import wizarddocx as wd

txt = wd.extract_text("docs/report.docx")

From bytes:

from pathlib import Path
import wizarddocx as wd

raw = Path("img.docx").read_bytes()
txt_img = wd.extract_text(raw, extension="docx")

Paged selection and OCR:

import wizarddocx as wd

sel = wd.extract_text("docs/big.docx", pages=[1, 3, "5-7"])
ocr_txt = wd.extract_text("scan.docx", ocr=True, language_ocr="ita")

Supported Formats

FormatOCR Option
DOCNot available
DOCXOptional

License

AGPL-3.0-or-later.

RESOURCES

Contact & Author

Author: Mattia Rubino
Email: textwizard.dev@gmail.com

Keywords

docx

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts