
Security News
Attackers Are Hunting High-Impact Node.js Maintainers in a Coordinated Social Engineering Campaign
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.
preocr
Advanced tools
A fast, layout-aware OCR decision engine for document processing pipelines. Detects whether files truly require OCR before expensive processing, reducing unnecessary OCR calls while preserving extraction reliability.
Open-source Python library for OCR detection and document extraction. Detect if PDFs need OCR before expensive processing—save 50–70% on OCR costs.
2–10× faster than alternatives • 100% accuracy on benchmark • CPU-only, no GPU required
🌐 preocr.io • Installation • Quick Start • API Reference • Examples • Performance
| Metric | Result |
|---|---|
| Accuracy | 100% (TP=1, FP=0, TN=9, FN=0) |
| Latency | ~2.7s mean, ~1.9s median (≤1MB PDFs) |
| Office docs | ~7ms |
| Focus | Zero false positives. Zero missed scans. |
PreOCR is an open-source Python OCR detection library that determines whether documents need OCR before you run expensive processing. It analyzes PDFs, Office documents (DOCX, PPTX, XLSX), images, and text files to detect if they're already machine-readable—helping you skip OCR for 50–70% of documents and cut costs.
Use PreOCR to filter documents before Tesseract, AWS Textract, Google Vision, Azure Document Intelligence, or MinerU. Works offline, CPU-only, with 100% accuracy on validation benchmarks.
| Feature | PreOCR 🏆 | Unstructured.io | Docugami |
|---|---|---|---|
| Speed | < 1 second | 5-10 seconds | 10-20 seconds |
| Cost Optimization | ✅ Skip OCR 50-70% | ❌ No | ❌ No |
| Page-Level Processing | ✅ Yes | ❌ No | ❌ No |
| Type Safety | ✅ Pydantic | ⚠️ Basic | ⚠️ Basic |
| Open Source | ✅ Yes | ✅ Partial | ❌ Commercial |
pip install preocr
from preocr import needs_ocr
result = needs_ocr("document.pdf")
if result["needs_ocr"]:
print("File needs OCR processing")
# Run your OCR engine here (MinerU, Tesseract, etc.)
else:
print("File is already machine-readable")
# Extract text directly
from preocr import extract_native_data
# Extract structured data from PDF
result = extract_native_data("invoice.pdf")
# Access elements, tables, forms
for element in result.elements:
print(f"{element.element_type}: {element.text}")
# Export to Markdown for LLM consumption (returns complete + pagewise)
res = extract_native_data("document.pdf", output_format="markdown")
full_md = res["complete"]
page_1_md = res["pagewise"][1]
from preocr import BatchProcessor
processor = BatchProcessor(max_workers=8)
results = processor.process_directory("documents/")
results.print_summary()
needs_ocr)skip_opencv_image_guard (default 50%) tunable per domainocr_complexity_score drives Tesseract/Paddle/Vision LLM selection when OCR neededplan_ocr_for_document / intent_refinement)intent_refinement — refines needs_ocr with domain-specific scoringSee docs/OCR_DECISION_MODEL.md for the full specification.
prepare_for_ocr)needs_ocr hints (suggest_preprocessing) when steps="auto"quality (full) or fast (skip denoise/rescale; deskew severe-only)config.auto_fixreturn_meta=True → applied_steps, skipped_steps, auto_detectedRequires pip install preocr[layout-refinement].
extract_native_data){"complete": str, "pagewise": Dict[int, str]} (v1.8.1)pip install preocr
For improved accuracy on edge cases:
pip install preocr[layout-refinement]
libmagic is required for file type detection:
sudo apt-get install libmagic1sudo yum install file-devel or sudo dnf install file-develbrew install libmagicpython-magic-bin packagefrom preocr import needs_ocr
result = needs_ocr("document.pdf")
print(f"Needs OCR: {result['needs_ocr']}")
print(f"Confidence: {result['confidence']:.2f}")
print(f"Reason: {result['reason']}")
result = needs_ocr("document.pdf", layout_aware=True)
# Debug misclassifications via raw signals
print(result["signals"]) # text_length, image_coverage, font_count, etc.
print(result["decision"]) # needs_ocr, confidence, reason_code
print(result["hints"]) # suggested_engine, suggest_preprocessing (when needs_ocr=True)
from preocr import plan_ocr_for_document # or intent_refinement
result = plan_ocr_for_document("hospital_discharge.pdf")
print(f"Needs OCR (any page): {result['needs_ocr_any']}")
print(f"Summary: {result['summary_reason']}")
for page in result["pages"]:
score = page.get("debug", {}).get("score", 0)
print(f" Page {page['page_number']}: needs_ocr={page['needs_ocr']} "
f"type={page['decision_type']} score={score:.2f}")
result = needs_ocr("document.pdf", layout_aware=True)
if result.get("layout"):
layout = result["layout"]
print(f"Layout Type: {layout['layout_type']}")
print(f"Text Coverage: {layout['text_coverage']}%")
print(f"Image Coverage: {layout['image_coverage']}%")
result = needs_ocr("mixed_document.pdf", page_level=True)
if result["reason_code"] == "PDF_MIXED":
print(f"Mixed PDF: {result['pages_needing_ocr']} pages need OCR")
for page in result["pages"]:
if page["needs_ocr"]:
print(f" Page {page['page_number']}: {page['reason']}")
from preocr import extract_native_data
# Extract as Pydantic model
result = extract_native_data("document.pdf")
# Access elements
for element in result.elements:
print(f"{element.element_type}: {element.text[:50]}...")
print(f" Confidence: {element.confidence:.2%}")
print(f" Bounding box: {element.bbox}")
# Access tables
for table in result.tables:
print(f"Table: {table.rows} rows × {table.columns} columns")
for cell in table.cells:
print(f" Cell [{cell.row}, {cell.col}]: {cell.text}")
# JSON output
json_data = extract_native_data("document.pdf", output_format="json")
# Markdown output (LLM-ready): returns {"complete": str, "pagewise": {1: str, 2: str, ...}}
res = extract_native_data("document.pdf", output_format="markdown")
full_md = res["complete"]
page_1_md = res["pagewise"][1]
# Clean markdown (content only, no metadata)
res = extract_native_data(
"document.pdf",
output_format="markdown",
markdown_clean=True
)
# Extract only pages 1-3
result = extract_native_data("document.pdf", pages=[1, 2, 3])
from preocr import BatchProcessor
# Configure processor
processor = BatchProcessor(
max_workers=8,
use_cache=True,
layout_aware=True,
page_level=True,
extensions=["pdf", "docx"],
)
# Process directory
results = processor.process_directory("documents/", progress=True)
# Get statistics
stats = results.get_statistics()
print(f"Processed: {stats['processed']} files")
print(f"Needs OCR: {stats['needs_ocr']} ({stats['needs_ocr']/stats['processed']*100:.1f}%)")
prepare_for_ocr)Apply detection-aware preprocessing before OCR. Use steps="auto" to let needs_ocr hints drive which steps run:
from preocr import needs_ocr, prepare_for_ocr
# Option A: steps="auto" (uses needs_ocr hints automatically)
result, meta = prepare_for_ocr("scan.pdf", steps="auto", return_meta=True)
print(meta["applied_steps"]) # e.g. ['otsu', 'deskew']
# Option B: Wire hints manually
ocr_result = needs_ocr("scan.pdf", layout_aware=True)
if ocr_result["needs_ocr"]:
hints = ocr_result["hints"]
preprocess = hints.get("suggest_preprocessing", [])
preprocessed = prepare_for_ocr("scan.pdf", steps=preprocess)
# Option C: Explicit steps
preprocessed = prepare_for_ocr(img_array, steps=["denoise", "otsu"], mode="fast")
# Option D: No preprocessing
preprocessed = prepare_for_ocr(img, steps=None) # unchanged
Steps: denoise → deskew → otsu → rescale. Modes: quality (full) or fast (skip denoise/rescale). Requires pip install preocr[layout-refinement].
from preocr import needs_ocr, prepare_for_ocr, extract_native_data
def process_document(file_path):
ocr_check = needs_ocr(file_path, layout_aware=True)
if ocr_check["needs_ocr"]:
# Preprocess then run OCR (steps from hints, or use steps="auto")
preprocessed = prepare_for_ocr(file_path, steps="auto")
hints = ocr_check["hints"]
engine = hints.get("suggested_engine", "tesseract") # tesseract | paddle | vision_llm
# Run OCR on preprocessed image(s)
return {"source": "ocr", "engine": engine, "text": "..."}
else:
result = extract_native_data(file_path)
return {"source": "native", "text": result.text}
PreOCR supports 20+ file formats for OCR detection and extraction:
| Format | OCR Detection | Extraction | Notes |
|---|---|---|---|
| ✅ Full | ✅ Full | Page-level analysis, layout-aware | |
| DOCX/DOC | ✅ Yes | ✅ Yes | Tables, metadata |
| PPTX/PPT | ✅ Yes | ✅ Yes | Slides, text |
| XLSX/XLS | ✅ Yes | ✅ Yes | Cells, tables |
| Images | ✅ Yes | ⚠️ Limited | PNG, JPG, TIFF, etc. |
| Text | ✅ Yes | ✅ Yes | TXT, CSV, HTML |
| Structured | ✅ Yes | ✅ Yes | JSON, XML |
from preocr import needs_ocr, Config
config = Config(
min_text_length=75,
min_office_text_length=150,
layout_refinement_threshold=0.85,
)
result = needs_ocr("document.pdf", config=config)
Core:
min_text_length: Minimum text length (default: 50)min_office_text_length: Minimum office text length (default: 100)layout_refinement_threshold: OpenCV trigger threshold (default: 0.9)Confidence Band:
confidence_exit_threshold: Skip OpenCV when confidence ≥ this (default: 0.90)confidence_light_refinement_min: Light refinement when confidence in [this, exit) (default: 0.50)skip_opencv_image_guard: In 0.75–0.90 band, run OpenCV only if image_coverage > this % (default: 50)Page-Level:
variance_page_escalation_std: Run full page-level when std(page_scores) > this (default: 0.18)Skip Heuristics:
skip_opencv_if_file_size_mb: Skip OpenCV when file size ≥ N MB (default: None)skip_opencv_if_page_count: Skip OpenCV when page count ≥ N (default: None)skip_opencv_max_image_coverage: Never skip when image_coverage > this (default: None)Bias Rules:
digital_bias_text_coverage_min: Force no-OCR when text_coverage ≥ this and image_coverage low (default: 65)table_bias_text_density_min: For mixed layout, treat as digital when text_density ≥ this (default: 1.5)PreOCR provides structured reason codes for programmatic handling:
No OCR Needed:
TEXT_FILE - Plain text fileOFFICE_WITH_TEXT - Office document with sufficient textPDF_DIGITAL - Digital PDF with extractable textSTRUCTURED_DATA - JSON/XML filesOCR Needed:
IMAGE_FILE - Image filePDF_SCANNED - Scanned PDFPDF_MIXED - Mixed digital and scanned pagesOFFICE_NO_TEXT - Office document with insufficient textExample:
result = needs_ocr("document.pdf")
if result["reason_code"] == "PDF_MIXED":
# Handle mixed PDF
process_mixed_pdf(result)
| Scenario | Time | Accuracy |
|---|---|---|
| Fast Path (Heuristics) | < 150ms | ~99% |
| OpenCV Refinement | 150-300ms | 92-96% |
| Typical (single file) | < 1 second | 94-97% |
Typical: most PDFs finish in under 1 second. Heuristics-only files: 120–180ms avg. Large or mixed documents may take 1–3s with OpenCV.
PreOCR Batch Benchmark: 192 PDFs, 1.5 files/sec, median 1134ms
Average Processing Time by File Type
Latency Summary (Mean, Median, P95)
Confusion Matrix (TP:1, FP:0, TN:9, FN:0)
PreOCR uses a hybrid adaptive pipeline with early exits, confidence bands, and optional OpenCV refinement.
File Input
↓
File Type Detection (mime, extension)
↓
Text Extraction (PDF/Office/Text/Image probe)
↓
┌─────────────────────────────────────────────────────────────────┐
│ PDF Early Exits (before layout/OpenCV) │
├─────────────────────────────────────────────────────────────────┤
│ 1. Hard Digital Guard: text_length ≥ threshold → NO OCR, return │
│ 2. Hard Scan Shortcut: image>85%, text<10, font_count==0 → OCR │
└─────────────────────────────────────────────────────────────────┘
↓ (if no early exit)
Layout Analysis (optional, if layout_aware=True)
↓
Collect Signals (text_length, image_coverage, font_count, text quality, etc.)
↓
Decision Engine (rule-based heuristics + OCR_SCORE)
↓
Confidence Band → OpenCV Refinement (PDFs only)
├─ ≥ 0.90: Immediate exit (skip OpenCV)
├─ 0.75–0.90: Skip OpenCV unless image_coverage > skip_opencv_image_guard (default 50%)
├─ 0.50–0.75: Light refinement (sample 2–3 pages)
└─ < 0.50: Full OpenCV refinement
↓
Return Result (signals, decision, hints)
| Exit | Condition | Action |
|---|---|---|
| Hard Digital | text_length ≥ hard_digital_text_threshold | NO OCR, return immediately |
| Hard Scan | image_coverage > 85%, text_length < 10, font_count == 0 | Needs OCR, skip layout/OpenCV |
The font_count == 0 guard prevents digital PDFs with background raster images from being misclassified as scans.
| Confidence | Action |
|---|---|
| ≥ 0.90 | Skip OpenCV entirely |
| 0.75–0.90 | Skip OpenCV unless image_coverage > skip_opencv_image_guard (default 50%) |
| 0.50–0.75 | Light refinement (2–3 pages sampled) |
| < 0.50 | Full OpenCV refinement |
When page_level=True, full page-level analysis runs only when std(page_scores) > 0.18. For uniform documents (all digital or all scanned), doc-level decision is reused for all pages—faster for large PDFs.
PDFs with a text layer but broken/invisible text (garbage) are detected via:
non_printable_ratio > 5%unicode_noise_ratio > 8%Such files are treated as needing OCR to avoid false negatives.
result = needs_ocr("document.pdf", layout_aware=True)
# Flat keys (backward compatible)
result["needs_ocr"] # True/False
result["confidence"] # 0.0–1.0
result["reason_code"] # e.g. "PDF_DIGITAL", "PDF_SCANNED"
# Structured (for debugging)
result["signals"] # Raw: text_length, image_coverage, font_count, non_printable_ratio, etc.
result["decision"] # {needs_ocr, confidence, reason_code, reason}
result["hints"] # {suggested_engine, suggest_preprocessing, ocr_complexity_score}
When needs_ocr=True, hints provides:
tesseract (< 0.3) | paddle (0.3–0.7) | vision_llm (> 0.7)["deskew", "otsu", "denoise"] — use with prepare_for_ocr(path, steps="auto")needs_ocr(file_path, page_level=False, layout_aware=False, config=None)Determine if a file needs OCR processing.
Parameters:
file_path (str or Path): Path to filepage_level (bool): Page-level analysis for PDFs (default: False)layout_aware (bool): Layout analysis for PDFs (default: False)config (Config): Custom configuration (default: None)Returns: Dictionary with:
needs_ocr, confidence, reason_code, reason, file_type, category{needs_ocr, confidence, reason_code, reason}{suggested_engine, suggest_preprocessing, ocr_complexity_score} when needs_ocr=Trueextract_native_data(file_path, include_tables=True, include_forms=True, include_metadata=True, include_structure=True, include_images=True, include_bbox=True, pages=None, output_format="pydantic", config=None)Extract structured data from machine-readable documents.
Parameters:
file_path (str or Path): Path to fileinclude_tables (bool): Extract tables (default: True)include_forms (bool): Extract form fields (default: True)include_metadata (bool): Include metadata (default: True)include_structure (bool): Detect sections (default: True)include_images (bool): Detect images (default: True)include_bbox (bool): Include bounding boxes (default: True)pages (list): Page numbers to extract (default: None = all)output_format (str): "pydantic", "json", or "markdown" (default: "pydantic")config (Config): Configuration (default: None)Returns:
ExtractionResult (Pydantic), Dict (JSON), or Dict with "complete" (str) and "pagewise" (Dict[int, str]) when output_format="markdown".
BatchProcessor(max_workers=None, use_cache=True, layout_aware=False, page_level=True, extensions=None, config=None)Batch processor for multiple files with parallel processing.
Parameters:
max_workers (int): Parallel workers (default: CPU count)use_cache (bool): Enable caching (default: True)layout_aware (bool): Layout analysis (default: False)page_level (bool): Page-level analysis (default: True)extensions (list): File extensions to process (default: None)config (Config): Configuration (default: None)Methods:
process_directory(directory, progress=True) -> BatchResultsprepare_for_ocr(source, steps=None, mode="quality", return_meta=False, pages=None, dpi=300, config=None)Prepare image(s) for OCR using detection-aware preprocessing.
Parameters:
source (str, Path, or ndarray): File path or numpy arraysteps (None | "auto" | list | dict): None = no preprocessing; "auto" = use needs_ocr hints; list/dict = explicit stepsmode (str): "quality" (full) or "fast" (skip denoise, rescale; deskew severe-only)return_meta (bool): Return (img, meta) with applied_steps, skipped_steps, auto_detectedpages (list): For PDFs, 1-indexed page numbers (default: None = all)dpi (int): Target DPI for rescale step (default: 300)config (PreprocessConfig): Optional; auto_fix=True auto-adds denoise when otsu requestedReturns: Processed numpy array or list of arrays. With return_meta=True: (img, meta).
Requires: pip install preocr[layout-refinement] (OpenCV, PyMuPDF for PDFs)
| Feature | PreOCR 🏆 | Unstructured.io | Docugami |
|---|---|---|---|
| Speed | < 1 second | 5-10 seconds | 10-20 seconds |
| Cost Optimization | ✅ Skip OCR 50-70% | ❌ No | ❌ No |
| Page-Level Processing | ✅ Yes | ❌ No | ❌ No |
| Type Safety | ✅ Pydantic | ⚠️ Basic | ⚠️ Basic |
| Confidence Scores | ✅ Per-element | ❌ No | ✅ Yes |
| Open Source | ✅ Yes | ✅ Partial | ❌ Commercial |
| CPU-Only | ✅ Yes | ✅ Yes | ⚠️ May need GPU |
Overall Score: PreOCR 91.4/100 🏆
✅ Choose PreOCR when:
PreOCR focuses on OCR routing—it doesn't perform extraction by default. Use it as a pre-filter: call needs_ocr() first, then route to your OCR engine or to extract_native_data() for digital documents. The API is simple: needs_ocr(path), extract_native_data(path), BatchProcessor.
1. File type detection fails
libmagic: sudo apt-get install libmagic1 (Linux) or brew install libmagic (macOS)2. PDF text extraction returns empty
pdfplumber and PyMuPDF3. OpenCV layout analysis not working
pip install preocr[layout-refinement]python -c "import cv2; print(cv2.__version__)"4. Low confidence scores
needs_ocr(file_path, layout_aware=True)Does PreOCR perform OCR?
No. PreOCR is an OCR detection library—it analyzes files to determine if OCR is needed. It does not run OCR itself. Use it to decide whether to call Tesseract, Textract, or another OCR engine.
How accurate is PreOCR for PDF OCR detection?
PreOCR achieves 92–95% accuracy with the hybrid pipeline. Validation on benchmark datasets reached 100% accuracy (10/10 PDFs correct).
Can I use PreOCR with AWS Textract, Google Vision, or Azure Document Intelligence?
Yes. PreOCR is ideal for filtering documents before sending them to cloud OCR APIs. Skip OCR for digital PDFs to reduce API costs.
Does PreOCR work offline?
Yes. PreOCR is CPU-only and runs fully offline—no API keys or internet required.
How do I customize OCR detection thresholds?
Use the Config class or pass threshold parameters to BatchProcessor. See Configuration.
Is there an HTTP/REST API?
PreOCR is a Python library. For HTTP APIs, wrap it in FastAPI or Flask—see preocr.io for hosted options.
# Clone repository
git clone https://github.com/yuvaraj3855/preocr.git
cd preocr
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
# Run benchmarks (add PDFs to datasets/ for testing)
python examples/test_preprocess_flow.py # Preprocess flow (needs layout-refinement)
python scripts/benchmark_preprocess_accuracy.py datasets # Preprocess + accuracy
python scripts/benchmark_accuracy.py datasets -g scripts/ground_truth_data_source_formats.json --layout-aware --page-level
python scripts/benchmark_batch_full.py datasets -v # Full dataset, PDF-wise log, diagram
python scripts/benchmark_planner.py datasets
# Run linting
ruff check preocr/
black --check preocr/
See CHANGELOG.md for complete version history.
v1.8.0 - Formatter Simplification (Latest)
v1.7.0 - Preprocess Module
steps="auto" uses needs_ocr hints; quality / fast modesv1.6.0 - Signal/Decision Separation & Confidence Band
font_count == 0 required for hard scan shortcut (avoids false positives on digital PDFs with background images)v1.1.0 - Invoice Intelligence & Advanced Extraction
v1.0.0 - Structured Data Extraction
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Apache License 2.0 - see LICENSE for details.
pip install preocrPreOCR – Python OCR detection library. Skip OCR for digital PDFs. Save time and money.
Website · GitHub · PyPI · Report Issue
FAQs
A fast, layout-aware OCR decision engine for document processing pipelines. Detects whether files truly require OCR before expensive processing, reducing unnecessary OCR calls while preserving extraction reliability.
We found that preocr demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.

Security News
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.

Security News
Node.js has paused its bug bounty program after funding ended, removing payouts for vulnerability reports but keeping its security process unchanged.