
The docling-OCR-OnnxTR repository provides a plugin that integrates the OnnxTR OCR engine into the Docling framework, enhancing document processing capabilities with efficient and accurate text recognition.
Key Features:
-
Seamless Integration: Easily incorporate OnnxTR's OCR functionalities into your Docling workflows for improved document parsing and analysis.
-
Optimized Performance: Leverages OnnxTR's lightweight architecture to deliver faster inference times and reduced resource consumption compared to traditional OCR engines.
-
Flexible Deployment: Supports various hardware configurations, including CPU, GPU, and OpenVINO, allowing you to choose the best setup for your needs.
Installation:
To install the plugin, use one of the following commands based on your hardware:
For GPU support please take a look at: ONNX Runtime.
- Prerequisites: CUDA & cuDNN needs to be installed before Version table.
pip install "docling-ocr-onnxtr[cpu]"
pip install "docling-ocr-onnxtr[gpu]"
pip install "docling-ocr-onnxtr[openvino]"
pip install "docling-ocr-onnxtr[cpu-headless]"
pip install "docling-ocr-onnxtr[gpu-headless]"
pip install "docling-ocr-onnxtr[openvino-headless]"
By integrating OnnxTR with Docling, users can achieve more efficient and accurate OCR results, enhancing the overall document processing experience.
Usage
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import (
ConversionResult,
DocumentConverter,
InputFormat,
PdfFormatOption,
)
from docling_ocr_onnxtr import OnnxtrOcrOptions
def main():
source = "https://arxiv.org/pdf/2408.09869v4"
ocr_options = OnnxtrOcrOptions(
det_arch="db_mobilenet_v3_large",
reco_arch="Felix92/onnxtr-parseq-multilingual-v1",
auto_correct_orientation=False,
)
pipeline_options = PdfPipelineOptions(
ocr_options=ocr_options,
)
pipeline_options.allow_external_plugins = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
),
},
)
conversion_result: ConversionResult = converter.convert(source=source)
doc = conversion_result.document
md = doc.export_to_markdown()
print(md)
if __name__ == "__main__":
main()
It is also possible to load the models from local files instead of using the Hugging Face Hub or downloading them from the repo:
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import (
ConversionResult,
DocumentConverter,
InputFormat,
PdfFormatOption,
)
from docling_ocr_onnxtr import OnnxtrOcrOptions
from onnxtr.models import db_mobilenet_v3_large, parseq
def main():
source = "https://arxiv.org/pdf/2408.09869v4"
det_model = db_mobilenet_v3_large("/home/felix/.cache/onnxtr/models/db_mobilenet_v3_large-1866973f.onnx")
reco_model = parseq("/home/felix/.cache/onnxtr/models/parseq-00b40714.onnx")
ocr_options = OnnxtrOcrOptions(
det_arch=det_model,
reco_arch=reco_model,
auto_correct_orientation=False,
)
pipeline_options = PdfPipelineOptions(
ocr_options=ocr_options,
)
pipeline_options.allow_external_plugins = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
),
},
)
conversion_result: ConversionResult = converter.convert(source=source)
doc = conversion_result.document
md = doc.export_to_markdown()
print(md)
if __name__ == "__main__":
main()
Configuration
The configuration of the OCR engine is done via the OnnxtrOcrOptions class. The following options are available:
lang: List of languages to use for OCR. Default is ["en", "fr"].
confidence_score: Word confidence threshold for the recognition model. Default is 0.5.
objectness_score: Detection model objectness score threshold. Default is 0.3.
det_arch: Detection model architecture. Default is "fast_base".
reco_arch: Recognition model architecture. Default is "crnn_vgg16_bn".
reco_bs: Batch size for the recognition model. Default is 512.
auto_correct_orientation: Whether to auto-correct the orientation of the pages. Default is False.
preserve_aspect_ratio: Whether to preserve the aspect ratio of the images. Default is True.
symmetric_pad: Whether to use symmetric padding. Default is True.
paragraph_break: Paragraph break threshold. Default is 0.035.
load_in_8_bit: Whether to load the model in 8-bit. Default is False. (Not supported for Hugging Face loaded models yet)
providers: List of providers to use for the Onnxruntime. Default is None which means auto-select.
session_options: Session options for the Onnxruntime. Default is None which means default OnnxTR session options.
Available Hugging Face models can be found at Hugging Face.
Further information:
Please take a look at OnnxTR.
Contributing
Contributions are welcome!
Before opening a pull request, please ensure that your code passes the tests and adheres to the project's coding standards.
You can run the tests and checks using:
make style
make quality
make test
License
Distributed under the Apache 2.0 License. See LICENSE for more information.