wizarddocx

Package Overview

Dependencies

Maintainers

Versions

Alerts

File Explorer

Advanced tools

License

Install Socket

Detect and block malicious and high-risk dependencies

Install

wizarddocx

Text extraction from Microsoft Word files. Parses Word documents natively and can optionally run local OCR with Tesseract for embedded images or scanned pages. Supports page selection and bytes input. Legacy .doc is read-only and OCR is not available.

Source

PyPI

Version: 1.0.0

Maintainers: 1

wizarddocx Banner

Wizard Docx

WizardDocx is a Python library focused on text extraction from Microsoft Word documents.
It parses Word documents natively and can apply local OCR with Tesseract for embedded images or scanned pages inside 'docx'.
Legacy .doc is supported in read-only mode without OCR.

Installation

Requires Python 3.9+.

pip install wizarddocx

For OCR capabilities, ensure you have Tesseract installed on your system.

Quick start

import wizarddocx as wd

text = wd.extract_text("example.docx")
print(text)

Text extraction

Parameters

input_data: [str, bytes, Path]
extension: The file extension, required only if input_data is bytes.
pages: page selection for .docx.
• Examples: 1, "1-3", [1, 3, "5-8"]
ocr: Enables OCR using Tesseract. Applies to DOCX and image-based files no for doc.
language_ocr: Language code for OCR. Defaults to 'eng'.

Examples

Basic:

import wizarddocx as wd

txt = wd.extract_text("docs/report.docx")

From bytes:

from pathlib import Path
import wizarddocx as wd

raw = Path("img.docx").read_bytes()
txt_img = wd.extract_text(raw, extension="docx")

Paged selection and OCR:

import wizarddocx as wd

sel = wd.extract_text("docs/big.docx", pages=[1, 3, "5-7"])
ocr_txt = wd.extract_text("scan.docx", ocr=True, language_ocr="ita")

Supported Formats

Format	OCR Option
DOC	Not available
DOCX	Optional

License

AGPL-3.0-or-later.

RESOURCES

Contact & Author

Author: Mattia Rubino
Email: textwizard.dev@gmail.com

Keywords

FAQs

What is wizarddocx?

Is wizarddocx well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

wizarddocx

Wizard Docx

Contents

Installation

Quick start

Text extraction

Parameters

Examples

Supported Formats

License

RESOURCES

Contact & Author

Keywords

Related posts

wizarddocx

Wizard Docx

Contents

Installation

Quick start

Text extraction

Parameters

Examples

Supported Formats

License

RESOURCES

Contact & Author

Keywords

Related posts

The Nightmare Before Deployment

Malicious NuGet Package Typosquats Popular .NET Tracing Library to Steal Wallet Passwords