
Research
/Security News
Critical Vulnerability in NestJS Devtools: Localhost RCE via Sandbox Escape
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
ocr-document-converter
Advanced tools
Enterprise-grade OCR and document conversion tool with dual OCR engines
Transform any document into searchable, editable text with enterprise-grade OCR technology
Designed and Built by Beau Lewis
Enterprise-Grade OCR โข Multi-Language โข AI-Powered โข Cross-Platform โข Professional GUI
A powerful, enterprise-ready OCR (Optical Character Recognition) document converter with advanced image processing, multi-language support, and intelligent text extraction. Features Tesseract and EasyOCR engines, batch processing, and professional deployment options.
๐ Quick Start โข โจ Features โข ๐ Formats โข ๐ ๏ธ Installation โข โ๏ธ Configuration โข ๐ Usage โข ๐ Project Structure โข ๐ค Contributing
OCR Document Converter is a professional-grade, enterprise-ready OCR application that extracts text from images and documents using advanced AI-powered engines. Built with dual OCR backends (Tesseract & EasyOCR), intelligent preprocessing, and multi-language support for maximum accuracy.
Clone this repository:
git clone https://github.com/Beaulewis1977/quick_ocr_doc_converter.git
cd quick_ocr_doc_converter
Run the automated setup:
python setup_ocr_environment.py
Launch the application:
python universal_document_converter_ocr.py
Or use one of the launchers:
run_ocr_converter.bat
or โก Quick Launch OCR.bat
python launch_ocr.py
python cli.py input.pdf -o output.txt -t txt --ocr
Install Python dependencies:
pip install -r requirements.txt
Install Tesseract OCR:
brew install tesseract
sudo apt-get install tesseract-ocr
Install additional language packs (optional):
# Example for German and French
sudo apt-get install tesseract-ocr-deu tesseract-ocr-fra
Format | Extension | Description | OCR Quality |
---|---|---|---|
JPEG | .jpg , .jpeg | Standard photo format | โญโญโญโญ |
PNG | .png | Lossless image format | โญโญโญโญโญ |
TIFF | .tiff , .tif | High-quality document format | โญโญโญโญโญ |
BMP | .bmp | Windows bitmap format | โญโญโญโญ |
GIF | .gif | Animated/static images | โญโญโญ |
WebP | .webp | Modern web format | โญโญโญโญ |
.pdf | Document format (image-based) | โญโญโญโญโญ |
.txt
) - Clean, formatted text.rtf
) - Formatted text with styling.docx
) - Professional documents.pdf
) - Searchable PDF with OCR layer.json
) - Structured data with metadata.csv
) - Tabular data extraction# tesseract_config.json
{
"engine": "tesseract",
"language": "eng+fra+deu", # Multiple languages
"oem": 3, # OCR Engine Mode (0-3)
"psm": 6, # Page Segmentation Mode (0-13)
"dpi": 300, # Target DPI for processing
"preprocessing": {
"denoise": true,
"contrast_enhance": true,
"rotation_correction": true
}
}
# easyocr_config.json
{
"engine": "easyocr",
"languages": ["en", "fr", "de"],
"gpu": false, # Use GPU acceleration
"batch_size": 1,
"workers": 0, # Number of worker threads
"confidence_threshold": 0.5
}
# gui_settings.json
{
"theme": "modern", # UI theme
"auto_preview": true, # Show preview automatically
"batch_size": 10, # Max files per batch
"output_directory": "./output",
"cache_duration": 24, # Hours to keep cache
"language_detection": true,
"progress_notifications": true
}
# processing_config.json
{
"max_threads": 4, # Parallel processing threads
"memory_limit": "2GB", # Maximum memory usage
"timeout": 300, # Processing timeout (seconds)
"retry_attempts": 3, # Retry failed operations
"temp_directory": "./temp",
"log_level": "INFO" # DEBUG, INFO, WARNING, ERROR
}
# Install additional Tesseract language packs
sudo apt-get install tesseract-ocr-[LANG]
# Common language codes:
# eng (English), fra (French), deu (German), spa (Spanish)
# chi_sim (Chinese Simplified), jpn (Japanese), ara (Arabic)
# rus (Russian), kor (Korean), hin (Hindi), por (Portuguese)
# language_config.json
{
"auto_detect": true,
"fallback_language": "eng",
"confidence_threshold": 0.8,
"supported_languages": [
"eng", "fra", "deu", "spa", "ita", "por",
"rus", "chi_sim", "jpn", "kor", "ara", "hin"
]
}
Launch the application:
python universal_document_converter_ocr.py
Basic OCR Process:
Batch Processing:
The OCR Document Converter includes a powerful CLI for automation and integration.
# Single file OCR
python cli.py document.jpg -o result.txt -t txt --ocr
# Convert without OCR
python cli.py document.pdf -o document.md -t md
# Batch processing
python cli.py *.jpg -o converted/ -t txt --ocr
# Specify OCR language
python cli.py scan.png -o text.txt --ocr --language fra
# For VFP9/VB6 users - simple command line execution
python cli.py input.md -o output.rtf -t rtf --quiet
# Full command with all options
python ocr_engine/ocr_engine.py \
--input document.pdf \
--output result.docx \
--engine easyocr \
--language en,fr,de \
--confidence 0.7 \
--preprocessing \
--format docx \
--dpi 300
Argument | Description | Example |
---|---|---|
--input | Input file/pattern | document.jpg , "*.png" |
--output | Output file | result.txt |
--output-dir | Output directory | ./results/ |
--engine | OCR engine | tesseract , easyocr , auto |
--language | Language codes | eng , eng+fra , en,fr,de |
--confidence | Confidence threshold | 0.5 to 1.0 |
--format | Output format | txt , docx , pdf , json |
--dpi | Target DPI | 150 , 300 , 600 |
--preprocessing | Enable preprocessing | Flag (no value) |
--batch-size | Batch processing size | 5 , 10 , 20 |
--threads | Number of threads | 1 , 4 , 8 |
from ocr_engine import OCREngine
# Initialize OCR engine
ocr = OCREngine(engine='tesseract', language='eng')
# Process single file
result = ocr.extract_text('document.jpg')
print(result.text)
# Save to file
ocr.save_result(result, 'output.txt', format='txt')
from ocr_engine import OCREngine, OCRConfig
# Custom configuration
config = OCRConfig(
engine='easyocr',
languages=['en', 'fr'],
confidence_threshold=0.8,
preprocessing=True,
dpi=300
)
# Initialize with config
ocr = OCREngine(config=config)
# Batch processing
files = ['doc1.jpg', 'doc2.png', 'doc3.pdf']
results = ocr.process_batch(files)
for file, result in results.items():
print(f"{file}: {result.confidence:.2f}")
ocr.save_result(result, f"{file}.txt")
from ocr_engine import OCREngine, OCRError
try:
ocr = OCREngine()
result = ocr.extract_text('document.jpg')
if result.confidence < 0.5:
print("Warning: Low confidence OCR result")
except OCRError as e:
print(f"OCR Error: {e}")
except FileNotFoundError:
print("Input file not found")
except Exception as e:
print(f"Unexpected error: {e}")
ocr_document_converter/
โโโ ๐ ocr_engine/ # Core OCR engine modules
โ โโโ __init__.py # Package initialization
โ โโโ ocr_engine.py # Main OCR engine class
โ โโโ ocr_engine_minimal.py # Lightweight OCR implementation
โ โโโ image_processor.py # Image preprocessing utilities
โ โโโ format_detector.py # File format detection
โ โโโ ocr_integration.py # Integration layer
โ
โโโ ๐ gui/ # GUI components
โ โโโ universal_document_converter_ocr.py # Main GUI application
โ โโโ universal_document_converter_enhanced.py # Enhanced GUI features
โ โโโ ocr_gui_integration.py # GUI-OCR integration
โ
โโโ ๐ tests/ # Test suite
โ โโโ test_ocr_integration.py # Integration tests
โ โโโ validate_ocr_integration.py # Validation scripts
โ โโโ test_data/ # Sample test files
โ โโโ sample_document.jpg
โ โโโ multi_language.png
โ โโโ low_quality.pdf
โ
โโโ ๐ config/ # Configuration files
โ โโโ tesseract_config.json # Tesseract settings
โ โโโ easyocr_config.json # EasyOCR settings
โ โโโ gui_settings.json # GUI preferences
โ โโโ language_config.json # Language settings
โ
โโโ ๐ output/ # Default output directory
โโโ ๐ temp/ # Temporary processing files
โโโ ๐ cache/ # OCR result cache
โโโ ๐ logs/ # Application logs
โ
โโโ ๐ requirements.txt # Python dependencies
โโโ ๐ setup_ocr_environment.py # Automated setup script
โโโ ๐ README.md # This comprehensive guide
โโโ ๐ OCR_README.md # Technical OCR documentation
โโโ ๐ OCR_INTEGRATION_COMPLETE.md # Integration completion notes
โโโ ๐ .gitignore # Git ignore rules
โโโ ๐ LICENSE # MIT License
File | Purpose | Key Features |
---|---|---|
ocr_engine/ocr_engine.py | Main OCR processing | Dual engine support, batch processing |
universal_document_converter_ocr.py | GUI application | Drag-drop, settings panel, progress tracking |
setup_ocr_environment.py | Automated installer | Dependencies, Tesseract, language packs |
test_ocr_integration.py | Comprehensive tests | Unit tests, integration tests, benchmarks |
validate_ocr_integration.py | Validation suite | System validation, performance tests |
requirements.txt | Dependencies | All Python packages with versions |
# Run all tests
python test_ocr_integration.py
# Run validation suite
python validate_ocr_integration.py
# Run specific test categories
python test_ocr_integration.py --category unit
python test_ocr_integration.py --category integration
python test_ocr_integration.py --category performance
Test Category | Files Tested | Success Rate | Avg. Processing Time |
---|---|---|---|
English Text | 100+ | 98.5% | 2.3s per page |
Multi-Language | 50+ | 95.2% | 3.1s per page |
Low Quality | 30+ | 87.8% | 4.2s per page |
Batch Processing | 500+ | 97.1% | 1.8s per page |
File: Universal-Document-Converter-v3.1.0-Windows-Complete.zip
(59 KB)
Contains EVERYTHING including:
cli.py
)# Download from GitHub Releases
https://github.com/Beaulewis1977/quick_ocr_doc_converter/releases/latest/download/Universal-Document-Converter-v2.1.0-Windows-Complete.zip
File: UniversalConverter32.dll.zip
(12 KB)
For users who ONLY need VFP9/VB6 integration:
# Download DLL package only
https://github.com/Beaulewis1977/quick_ocr_doc_converter/releases/latest/download/UniversalConverter32.dll.zip
install.bat
as Administratorrun_ocr_converter.bat
# Clone and setup in one command
git clone https://github.com/Beaulewis1977/quick_ocr_document_converter.git
cd quick_ocr_document_converter
python setup_ocr_environment.py
# Create virtual environment (recommended)
python -m venv ocr_env
source ocr_env/bin/activate # Linux/Mac
# or
ocr_env\Scripts\activate # Windows
# Install Python dependencies
pip install -r requirements.txt
Windows:
# Download and install from:
# https://github.com/UB-Mannheim/tesseract/wiki
# Add to PATH: C:\Program Files\Tesseract-OCR
macOS:
# Using Homebrew
brew install tesseract
# Install language packs
brew install tesseract-lang
Linux (Ubuntu/Debian):
# Install Tesseract
sudo apt-get update
sudo apt-get install tesseract-ocr
# Install language packs
sudo apt-get install tesseract-ocr-eng tesseract-ocr-fra tesseract-ocr-deu
Linux (CentOS/RHEL):
# Install Tesseract
sudo yum install epel-release
sudo yum install tesseract tesseract-langpack-eng
# Install PyTorch (CPU version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# For GPU support (optional)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Dockerfile
FROM python:3.9-slim
# Install system dependencies
RUN apt-get update && apt-get install -y \
tesseract-ocr \
tesseract-ocr-eng \
tesseract-ocr-fra \
tesseract-ocr-deu \
libgl1-mesa-glx \
libglib2.0-0
# Copy application
COPY . /app
WORKDIR /app
# Install Python dependencies
RUN pip install -r requirements.txt
# Run application
CMD ["python", "universal_document_converter_ocr.py"]
# Build and run Docker container
docker build -t ocr-converter .
docker run -p 8080:8080 -v $(pwd)/output:/app/output ocr-converter
# Error: TesseractNotFoundError
# Solution: Add Tesseract to PATH
export PATH=$PATH:/usr/local/bin/tesseract # Linux/Mac
# or add C:\Program Files\Tesseract-OCR to Windows PATH
# Try different preprocessing options
config = {
"preprocessing": {
"denoise": True,
"contrast_enhance": True,
"rotation_correction": True,
"dpi_optimization": True
}
}
# Reduce batch size and enable memory optimization
config = {
"batch_size": 1,
"memory_limit": "1GB",
"enable_gc": True
}
# Specify languages explicitly
config = {
"language": "eng+fra+deu", # Multiple languages
"auto_detect": False
}
# Enable debug logging
export OCR_DEBUG=1
python universal_document_converter_ocr.py --debug
# Check log files
tail -f logs/ocr_debug.log
logs/ocr_application.log
python validate_ocr_integration.py
tests/test_data/
git checkout -b feature/amazing-feature
python test_ocr_integration.py
git commit -m 'Add amazing feature'
git push origin feature/amazing-feature
# Clone your fork
git clone https://github.com/YOUR_USERNAME/quick_ocr_document_converter.git
cd quick_ocr_document_converter
# Create development environment
python -m venv dev_env
source dev_env/bin/activate
# Install development dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt
# Run tests
python -m pytest tests/
# Run linting
flake8 ocr_engine/
black ocr_engine/
This project is licensed under the MIT License - see the LICENSE file for details.
Building and maintaining OCR Document Converter takes time and resources. While the tool is completely free, your voluntary support helps ensure continued development and improvements.
If this tool has saved you time or added value to your work, consider showing your appreciation:
Venmo: @BeauinTulsa
Ko-fi: https://ko-fi.com/beaulewis
Together, we're making document conversion accessible to everyone. Thank you! ๐ช
Made with โค๏ธ for the OCR community
โญ Star this repository if it helped you! โญ
create_executable.py
python universal_document_converter.py
Input Formats (6) | Output Formats (5) |
---|---|
DOCX - Microsoft Word Documents | Markdown - GitHub-flavored markdown |
PDF - Portable Document Format | TXT - Plain text with formatting |
TXT - Plain text files | HTML - Clean, semantic HTML |
HTML - Web pages and documents | RTF - Rich Text Format |
RTF - Rich Text Format | EPUB - Electronic Publication (eBooks) |
EPUB - Electronic Publication (eBooks) |
Total Conversion Combinations: 30 (6 ร 5)
FAQs
Enterprise-grade OCR and document conversion tool with dual OCR engines
We found that ocr-document-converter demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.ย It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
Product
Customize license detection with Socketโs new license overlays: gain control, reduce noise, and handle edge cases with precision.
Product
Socket now supports Rust and Cargo, offering package search for all users and experimental SBOM generation for enterprise projects.