
pdf2ocr
A CLI tool to apply OCR on PDF files and export to multiple formats.
π Features
- π Extracts text from scanned PDFs using Tesseract OCR with advanced image preprocessing
- π Outputs DOCX, HTML, EPUB and searchable PDF files with preserved paragraph structure
- π Converts DOCX to EPUB via Calibre, including metadata
- π Displays progress bars and detailed summary logs
- π Supports layout-preserving mode for high-fidelity PDF OCR
- πΌοΈ Advanced image enhancement for improved OCR accuracy on distorted documents
- β‘ Intelligent preprocessing with noise reduction, contrast optimization, and text sharpening
πΌοΈ Advanced Image Processing
pdf2ocr
includes sophisticated image preprocessing to maximize OCR accuracy, especially for documents with distortions, poor contrast, or noise:
π§ Automatic Enhancements Applied:
- Grayscale Conversion - Optimizes images for text recognition
- Auto Contrast - Automatically adjusts contrast for better text visibility
- Noise Reduction - Removes artifacts while preserving text edges
- Adaptive Histogram Equalization (CLAHE) - Enhances local contrast for varying lighting conditions
- Text Sharpening - Improves character definition and clarity
- Unsharp Masking - Fine-tunes text edges for optimal recognition
π Benefits:
- β
Better accuracy on scanned documents with poor quality
- β
Improved recognition of faded or low-contrast text
- β
Enhanced performance on documents with noise or artifacts
- β
Automatic fallbacks ensure processing never fails
- β
Works with all output formats (PDF, DOCX, HTML, EPUB)
- β
Configurable quality with
--dpi
option (72-1200 range)
βοΈ Quality Control:
The --dpi
parameter controls the resolution of PDF to image conversion:
- Low DPI (72-150): Faster processing, smaller memory usage, suitable for clean documents
- Medium DPI (200-400): Balanced quality and performance (default: 400)
- High DPI (500-1200): Maximum quality for challenging documents, slower processing
π‘ Note: All image enhancements are applied automatically - no configuration needed!
π Quick Install & Usage
Install globally
Install pdf2ocr
and use it as a command-line tool:
pip install pdf2ocr
π Usage Examples
Generate multiple output formats with logging:
pdf2ocr ./pdfs --docx --pdf --epub --html --dest-dir ./output --logfile pdf2ocr.log
Generate layout-preserving OCR PDFs only:
pdf2ocr ./pdfs --pdf --preserve-layout --dest-dir ./output --logfile pdf2ocr.log
Process multiple files in parallel with 8 workers:
pdf2ocr ./pdfs --pdf --html --epub --workers 8 --logfile pdf2ocr.log
Enable batch processing for large PDFs to reduce memory usage:
pdf2ocr ./pdfs --pdf --batch-size 5 --logfile pdf2ocr.log
High-quality OCR with custom DPI for challenging documents:
pdf2ocr ./pdfs --pdf --dpi 600 --logfile pdf2ocr.log
Fast processing for clean documents with lower DPI:
pdf2ocr ./pdfs --pdf --dpi 150 --logfile pdf2ocr.log
β οΈ When using --preserve-layout
, only PDF output is supported. Other formats will be automatically disabled.
π Language Support
pdf2ocr
is currently optimized for Portuguese π§π·π΅πΉ and uses it as the default OCR language.
You can override the language using the --lang
option. Examples:
pdf2ocr ./pdfs --pdf --lang eng
pdf2ocr ./pdfs --pdf --lang spa
pdf2ocr ./pdfs --pdf --lang fra
To check the code for all languages supported by Tesseract, run the command below:
tesseract --list-langs
π§± System Requirements and Tesseract language models
Ubuntu / Debian (APT)
Install Tesseract OCR and the most common language models:
sudo apt update && sudo apt install tesseract-ocr \
tesseract-ocr-por tesseract-ocr-eng tesseract-ocr-spa \
tesseract-ocr-fra tesseract-ocr-ita
For optimal image processing performance, also install:
sudo apt install python3-scipy python3-skimage
Or, to install all available language models:
sudo apt install tesseract-ocr-all
Fedora / Red Hat / CentOS / AlmaLinux / Rocky Linux (DNF or YUM)
OCR requirements:
sudo dnf install tesseract poppler-utils calibre
sudo yum install tesseract poppler-utils calibre
To install additional OCR language models:
sudo dnf install tesseract-langpack-por tesseract-langpack-eng \
tesseract-langpack-spa tesseract-langpack-fra tesseract-langpack-ita
There is no equivalent to tesseract-ocr-all
on Red Hat-based systems β install only the languages you need.
macOS (Homebrew)
brew install tesseract poppler
brew install --cask calibre
π‘ Tip for macOS/Homebrew users:
π Important: If ebook-convert
is not available after installing Calibre, add it to your PATH:
export PATH="$PATH:/Applications/calibre.app/Contents/MacOS"
To make it permanent:
echo 'export PATH="$PATH:/Applications/calibre.app/Contents/MacOS"' >> ~/.zshrc
source ~/.zshrc
Check Calibre installation:
ebook-convert --version
π Python Setup (for development)
To use in a virtual environment:
python3 -m venv venv_pdf2ocr
source venv_pdf2ocr/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
π¦ Dependencies
The tool includes advanced image processing capabilities with the following key dependencies:
- Core OCR:
pytesseract
, pdf2image
, pillow
- Document Generation:
python-docx
, reportlab
, pypdf
- Advanced Image Processing:
numpy
, scipy
, scikit-image
- Progress & UI:
tqdm
π‘ Note: Advanced image processing dependencies (scipy
, scikit-image
) are optional - the tool will automatically fall back to basic processing if they're not available.
βοΈ Command Line Options
pdf2ocr -h
source_folder
: Folder containing the input PDF files.
--dest-dir
: Destination folder for output files (default: same as input).
--docx
: Generate DOCX files with preserved paragraph structure.
--pdf
: Generate OCR-processed PDF files.
--epub
: Generate EPUB files (requires --docx
; uses Calibre).
--html
: Generate HTML files.
--preserve-layout
: Preserve the visual layout of original documents (PDF only).
--lang
: Set the OCR language code (default: por). Use tesseract --list-langs to check installed options.
--quiet
: Run silently without progress output.
--summary
: Display only final conversion summary.
--logfile
: Path to save detailed log output (UTF-8 encoded).
--workers
: Number of parallel workers for processing (default: 2).
--batch-size
: Number of pages to process in each batch (disabled by default). Use this to optimize memory usage for large PDFs.
--dpi
: DPI for PDF to image conversion (default: 400, range: 72-1200). Higher values improve OCR quality but increase processing time and memory usage.
--version
: show program's version number and exit
π οΈ Makefile Commands
make venv | Create and set up a virtual environment (venv_pdf2ocr ) |
make install | Install pdf2ocr globally (or into active virtualenv) |
make run | Run pdf2ocr with example parameters (PDF, DOCX, EPUB, HTML) |
make test | Run automated tests with pytest |
make lint | Run flake8 to check code quality |
make format | Auto-format code using black and isort |
make clean | Remove Python cache, logs, build files and other generated artifacts |
π License
MIT