
Wizard Docx

WizardDocx is a Python library focused on text extraction from Microsoft Word documents.
It parses Word documents natively and can apply local OCR with Tesseract for embedded images or scanned pages inside 'docx'.
Legacy .doc is supported in read-only mode without OCR.
Contents
Installation
Requires Python 3.9+.
pip install wizarddocx
For OCR capabilities, ensure you have Tesseract installed on your system.
Quick start
import wizarddocx as wd
text = wd.extract_text("example.docx")
print(text)
Parameters
input_data: [str, bytes, Path]
extension: The file extension, required only if input_data is bytes.
pages: page selection for .docx.
• Examples: 1, "1-3", [1, 3, "5-8"]
ocr: Enables OCR using Tesseract. Applies to DOCX and image-based files no for doc.
language_ocr: Language code for OCR. Defaults to 'eng'.
Examples
Basic:
import wizarddocx as wd
txt = wd.extract_text("docs/report.docx")
From bytes:
from pathlib import Path
import wizarddocx as wd
raw = Path("img.docx").read_bytes()
txt_img = wd.extract_text(raw, extension="docx")
Paged selection and OCR:
import wizarddocx as wd
sel = wd.extract_text("docs/big.docx", pages=[1, 3, "5-7"])
ocr_txt = wd.extract_text("scan.docx", ocr=True, language_ocr="ita")
Supported Formats
| DOC | Not available |
| DOCX | Optional |
License
AGPL-3.0-or-later.
RESOURCES
Contact & Author
Author: Mattia Rubino
Email: textwizard.dev@gmail.com