
Security News
The Nightmare Before Deployment
Season’s greetings from Socket, and here’s to a calm end of year: clean dependencies, boring pipelines, no surprises.
wizarddocx
Advanced tools
Text extraction from Microsoft Word files. Parses Word documents natively and can optionally run local OCR with Tesseract for embedded images or scanned pages. Supports page selection and bytes input. Legacy .doc is read-only and OCR is not available.

WizardDocx is a Python library focused on text extraction from Microsoft Word documents.
It parses Word documents natively and can apply local OCR with Tesseract for embedded images or scanned pages inside 'docx'.
Legacy .doc is supported in read-only mode without OCR.
Requires Python 3.9+.
pip install wizarddocx
For OCR capabilities, ensure you have Tesseract installed on your system.
import wizarddocx as wd
text = wd.extract_text("example.docx")
print(text)
input_data: [str, bytes, Path]extension: The file extension, required only if input_data is bytes.pages: page selection for .docx.1, "1-3", [1, 3, "5-8"]ocr: Enables OCR using Tesseract. Applies to DOCX and image-based files no for doc.language_ocr: Language code for OCR. Defaults to 'eng'.Basic:
import wizarddocx as wd
txt = wd.extract_text("docs/report.docx")
From bytes:
from pathlib import Path
import wizarddocx as wd
raw = Path("img.docx").read_bytes()
txt_img = wd.extract_text(raw, extension="docx")
Paged selection and OCR:
import wizarddocx as wd
sel = wd.extract_text("docs/big.docx", pages=[1, 3, "5-7"])
ocr_txt = wd.extract_text("scan.docx", ocr=True, language_ocr="ita")
| Format | OCR Option |
|---|---|
| DOC | Not available |
| DOCX | Optional |
Author: Mattia Rubino
Email: textwizard.dev@gmail.com
FAQs
Text extraction from Microsoft Word files. Parses Word documents natively and can optionally run local OCR with Tesseract for embedded images or scanned pages. Supports page selection and bytes input. Legacy .doc is read-only and OCR is not available.
We found that wizarddocx demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Season’s greetings from Socket, and here’s to a calm end of year: clean dependencies, boring pipelines, no surprises.

Research
/Security News
Impostor NuGet package Tracer.Fody.NLog typosquats Tracer.Fody and its author, using homoglyph tricks, and exfiltrates Stratis wallet JSON/passwords to a Russian IP address.

Security News
Deno 2.6 introduces deno audit with a new --socket flag that plugs directly into Socket to bring supply chain security checks into the Deno CLI.