
Security News
Axios Supply Chain Attack Reaches OpenAI macOS Signing Pipeline, Forces Certificate Rotation
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.
pyxtxt
Advanced tools
A Python library for extracting text from different types of files (PDF, DOCX, PPTX, XLSX, ODT, ecc.).
PyxTxt is a simple and powerful Python library to extract text from various file formats.
It supports PDF, DOCX, XLSX, PPTX, ODT, HTML, XML, TXT, legacy Office files, audio/video transcription, OCR from images, and more.
NEW in v0.2.4: Added video transcription support! Now supports both audio and video files using Whisper.
io.BytesIO buffers, raw bytes objects, and requests.Response objectspython-magic for intelligent file type recognitionThe library is modular so you can install all modules:
pip install pyxtxt[all]
or just the modules you need:
pip install pyxtxt[pdf,docx,presentation,spreadsheet,html,markdown,epub,email]
# Audio transcription (~2GB download for Whisper models)
pip install pyxtxt[audio]
# Traditional OCR from images (~1GB download for EasyOCR models)
pip install pyxtxt[ocr]
# AI-powered OCR via Ollama (requires local Ollama + gemma3:4b model)
pip install pyxtxt[ocr-ollama]
# Both audio and traditional OCR
pip install pyxtxt[audio,ocr]
Because needed libraries are common, installing the html module will also enable SVG and XML support. The architecture is designed to grow with new modules for additional formats.
The pyproject.toml file should select the correct version for your system. But if you have any problem you can install it manually.
On Ubuntu/Debian:
sudo apt install libmagic1
On Mac (Homebrew):
brew install libmagic
On Windows:
Use python-magic-bin instead of python-magic for easier installation.
Dependencies are automatically installed based on selected optional groups.
Some extractors require system-level tools to be installed:
Legacy DOC files: antiword - Install via your package manager:
# Ubuntu/Debian
sudo apt install antiword
# macOS
brew install antiword
# CentOS/RHEL
sudo yum install antiword
Audio/Video transcription: ffmpeg - Required for audio preprocessing:
# Ubuntu/Debian
sudo apt install ffmpeg
# macOS
brew install ffmpeg
# Windows
# Download from https://ffmpeg.org/download.html
from pyxtxt import xtxt
# Extract from file path
text = xtxt("document.pdf")
print(text)
# Extract from BytesIO buffer
import io
with open("document.docx", "rb") as f:
buffer = io.BytesIO(f.read())
text = xtxt(buffer)
print(text)
import requests
from pyxtxt import xtxt, xtxt_from_url
# Method 1: Direct from bytes
response = requests.get("https://example.com/document.pdf")
text = xtxt(response.content)
# Method 2: Direct from Response object
text = xtxt(response)
# Method 3: URL helper function
text = xtxt_from_url("https://example.com/document.pdf")
from pyxtxt import xtxt
# Transcribe audio files
text = xtxt("meeting_recording.mp3")
text = xtxt("interview.wav")
text = xtxt("podcast.m4a")
# Transcribe video files (extracts audio)
text = xtxt("presentation.mp4")
text = xtxt("conference_video.mov")
text = xtxt("webinar.avi")
# From web audio/video
import requests
audio_response = requests.get("https://example.com/audio.mp3")
text = xtxt(audio_response.content)
video_response = requests.get("https://example.com/video.mp4")
text = xtxt(video_response.content)
from pyxtxt import xtxt
# Traditional OCR with EasyOCR (install with: pip install pyxtxt[ocr])
text = xtxt("scanned_document.png")
text = xtxt("screenshot.jpg")
text = xtxt("invoice.tiff")
# Extract EXIF metadata from photos (uses Pillow, already included)
from pyxtxt import xtxt_exif
exif_data = xtxt_exif("vacation_photo.jpg")
print(exif_data)
# Output: Camera make/model, GPS coordinates, shooting settings, datetime, etc.
# AI-powered OCR with Ollama (install with: pip install pyxtxt[ocr-ollama])
# Requires: ollama server running + gemma3:4b model
from pyxtxt import (
xtxt, xtxt_image_describe,
set_ollama_model, set_ollama_config, get_ollama_config
)
# Configure model (optional, default is gemma3:4b)
set_ollama_model("gemma3:12b") # or llava:7b, llava:13b, gemma3:27b
# Configure LLM parameters for better captions
set_ollama_config(
language='italian', # Language hint for captions
caption_length='long', # short, medium, long
style='detailed', # descriptive, technical, simple, detailed
temperature=0.2, # Creativity level (0.0-1.0)
max_tokens=2000 # Maximum response length
)
# Extract only text (OCR mode)
text = xtxt("complex_document.png")
print(f"Extracted text: {text}")
# Extract text + detailed caption
full_analysis = xtxt_image_describe("scientific_diagram.png")
print(full_analysis)
# Output example:
# TEXT: Figura 2.1: Struttura molecolare del DNA
# DESCRIPTION: Diagramma scientifico dettagliato che mostra la doppia elica del DNA
# con nucleotidi colorati, legami idrogeno evidenziati e etichette in italiano per
# le basi azotate (adenina, timina, citosina, guanina).
# Check current configuration
config = get_ollama_config()
print(f"Current config: {config}")
# Reset to defaults if needed
from pyxtxt import reset_ollama_config
reset_ollama_config()
# From web images
import requests
image_response = requests.get("https://example.com/document.png")
text = xtxt(image_response.content)
⚠️ IMPORTANT: AI-powered OCR is experimental technology that may produce errors, hallucinations, or misinterpretations. Always validate results for critical applications.
The OCR-Ollama system includes confidence scoring to help identify unreliable results:
from pyxtxt import xtxt_image_with_confidence, set_ollama_config
# Configure confidence threshold (0.0-1.0, default: 0.7)
set_ollama_config(confidence_threshold=0.8) # More restrictive
# Get text with confidence score
text, confidence = xtxt_image_with_confidence("document.png", mode="ocr")
print(f"Confidence: {confidence:.2f} ({confidence*100:.1f}%)")
print(f"Text: {text}")
# Check if reliable
if confidence < 0.7:
print("⚠️ Low confidence - result may be unreliable")
print("Consider using traditional OCR or manual verification")
else:
print("✅ Good confidence - result likely reliable")
A complete example script for command-line usage is available:
# Download and run the example script
import requests
example_url = "https://raw.githubusercontent.com/yourusername/pyxtxt/main/ocr_example.py"
with open("ocr_example.py", "wb") as f:
f.write(requests.get(example_url).content)
# Usage examples:
# python ocr_example.py document.png
# python ocr_example.py chart.jpg --mode=describe --lang=italian --style=detailed
# python ocr_example.py diagram.png --mode=describe --length=long --temp=0.3
# NEW: Confidence scoring examples
# python ocr_example.py suspicious.png --show-confidence
# python ocr_example.py medical.png --confidence=0.9 --show-confidence
# python ocr_example.py diagram.png --confidence=0.5 --mode=describe --show-confidence
The script supports:
from pyxtxt import extxt_available_formats
# List supported MIME types
formats = extxt_available_formats()
print(formats)
# Pretty format names
formats = extxt_available_formats(pretty=True)
print(formats)
# API responses
api_response = requests.post("https://api.example.com/generate-pdf")
text = xtxt(api_response.content)
# File uploads (Flask/Django)
uploaded_bytes = request.files['document'].read()
text = xtxt(uploaded_bytes)
# Audio/video transcription services
audio_response = requests.get("https://api.example.com/recording.mp3")
transcript = xtxt(audio_response.content)
# Video transcription from API
video_response = requests.get("https://api.example.com/meeting.mp4")
transcript = xtxt(video_response.content)
# OCR for uploaded images
image_bytes = request.files['receipt'].read()
text = xtxt(image_bytes)
# Email attachments
attachment_bytes = email_msg.get_payload(decode=True)
text = xtxt(attachment_bytes)
antiword installation:
sudo apt-get update && sudo apt-get install antiword
⚠️ EXPERIMENTAL TECHNOLOGY: AI-powered features (OCR-Ollama, audio transcription) are based on machine learning models and may produce:
🚨 DO NOT USE for critical applications such as:
✅ Content discovery and initial text extraction
✅ Batch processing of low-stakes content
✅ Development and prototyping workflows
✅ Personal document organization
✅ Educational and learning projects
After installing PyxTxt from PyPI, you can access comprehensive usage examples including local file processing, memory buffer handling, web content extraction, error handling patterns, and all supported formats demonstration:
import pkg_resources
# Get path to examples file
examples_path = pkg_resources.resource_filename('pyxtxt', 'examples.py')
print(f"Examples file location: {examples_path}")
# Run the examples directly
exec(open(examples_path).read())
# Or read the content to view examples
examples_content = pkg_resources.resource_string('pyxtxt', 'examples.py').decode('utf-8')
print(examples_content)
Distributed under the MIT License. See LICENSE file for details.
The software is provided "as is" without any warranty of any kind.
Pull requests, issues, and feedback are warmly welcome! 🚀
set_ollama_config() for fine-tuning LLM behaviorxtxt_image_with_confidence() - Returns text + confidence score--show-confidence and --confidence CLI parameters[audio], [ocr], [all] installation groupsbytes objects (web downloads, API responses)requests.Response objects (direct HTTP processing)xtxt_from_url() helper function for direct URL processingFAQs
A Python library for extracting text from different types of files (PDF, DOCX, PPTX, XLSX, ODT, ecc.).
We found that pyxtxt demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.

Security News
Open source is under attack because of how much value it creates. It has been the foundation of every major software innovation for the last three decades. This is not the time to walk away from it.

Security News
Socket CEO Feross Aboukhadijeh breaks down how North Korea hijacked Axios and what it means for the future of software supply chain security.