Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
deepdoctection is a Python library that orchestrates document extraction and document layout analysis tasks using deep learning models. It does not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR and selected NLP tasks and provides an integrated framework for fine-tuning, evaluating and running models. For more specific text processing tasks use one of the many other great NLP libraries.
deepdoctection focuses on applications and is made for those who want to solve real world problems related to document extraction from PDFs or scans in various image formats.
Check the demo of a document layout analysis pipeline with OCR on :hugs: Hugging Face spaces.
deepdoctection provides model wrappers of supported libraries for various tasks to be integrated into pipelines. Its core function does not depend on any specific deep learning library. Selected models for the following tasks are currently supported:
deepdoctection provides on top of that methods for pre-processing inputs to models like cropping or resizing and to post-process results, like validating duplicate outputs, relating words to detected layout segments or ordering words into contiguous text. You will get an output in JSON format that you can customize even further by yourself.
Have a look at the introduction notebook in the notebook repo for an easy start.
Check the release notes for recent updates.
deepdoctection or its support libraries provide pre-trained models that are in most of the cases available at the Hugging Face Model Hub or that will be automatically downloaded once requested. For instance, you can find pre-trained object detection models from the Tensorpack or Detectron2 framework for coarse layout analysis, table cell detection and table recognition.
Training is a substantial part to get pipelines ready on some specific domain, let it be document layout analysis, document classification or NER. deepdoctection provides training scripts for models that are based on trainers developed from the library that hosts the model code. Moreover, deepdoctection hosts code to some well established datasets like Publaynet that makes it easy to experiment. It also contains mappings from widely used data formats like COCO and it has a dataset framework (akin to datasets so that setting up training on a custom dataset becomes very easy. This notebook shows you how to do this.
deepdoctection comes equipped with a framework that allows you to evaluate predictions of a single or multiple models in a pipeline against some ground truth. Check again here how it is done.
Having set up a pipeline it takes you a few lines of code to instantiate the pipeline and after a for loop all pages will be processed through the pipeline.
import deepdoctection as dd
from IPython.core.display import HTML
from matplotlib import pyplot as plt
analyzer = dd.get_dd_analyzer() # instantiate the built-in analyzer similar to the Hugging Face space demo
df = analyzer.analyze(path = "/path/to/your/doc.pdf") # setting up pipeline
df.reset_state() # Trigger some initialization
doc = iter(df)
page = next(doc)
image = page.viz()
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)
HTML(page.tables[0].html)
print(page.text)
There is an extensive documentation available containing tutorials, design concepts and the API. We want to present things as comprehensively and understandably as possible. However, we are aware that there are still many areas where significant improvements can be made in terms of clarity, grammar and correctness. We look forward to every hint and comment that increases the quality of the documentation.
Everything in the overview listed below the deepdoctection layer are necessary requirements and have to be installed separately.
Linux or macOS. (Windows is not supported but there is a Dockerfile available)
Python >= 3.9
1.13 <= PyTorch or 2.11 <= Tensorflow < 2.16. (For lower Tensorflow versions the code will only run on a GPU). In general, if you want to train or fine-tune models, a GPU is required.
With respect to the Deep Learning framework, you must decide between Tensorflow and PyTorch.
Tesseract OCR engine will be used through a Python wrapper. The core engine has to be installed separately.
For release v.0.34.0
and below deepdoctection uses Python wrappers for Poppler to convert PDF
documents into images. For release v.0.35.0
this dependency will be optional.
The following overview shows the availability of the models in conjunction with the DL framework.
Task | PyTorch | Torchscript | Tensorflow |
---|---|---|---|
Layout detection via Detectron2/Tensorpack | ✅ | ✅ (CPU only) | ✅ (GPU only) |
Table recognition via Detectron2/Tensorpack | ✅ | ✅ (CPU only) | ✅ (GPU only) |
Table transformer via Transformers | ✅ | ❌ | ❌ |
DocTr | ✅ | ❌ | ✅ |
LayoutLM (v1, v2, v3, XLM) via Transformers | ✅ | ❌ | ❌ |
We recommend using a virtual environment. You can install the package via pip or from source.
If you want to get started with a minimal setting (e.g. running the deepdoctection analyzer with default configuration or trying the 'Get started notebook'), install deepdoctection with
pip install deepdoctection
If you want to use the Tensorflow framework, please install Tensorpack separately. Detectron2 will not be installed and layout models/ table recognition models will run with Torchscript on a CPU.
The following installation will give you ALL models available within the Deep Learning framework as well as all models that are independent of Tensorflow/PyTorch. Please note, that the dependencies are very complex. We try hard to keep the requirements up to date though.
For Tensorflow, run
pip install deepdoctection[tf]
For PyTorch,
first install Detectron2 separately as it is not distributed via PyPi. Check the instruction here. Then run
pip install deepdoctection[pt]
This will install deepdoctection with all dependencies listed above the deepdoctection layer. Use this setting, if you want to get started or want to explore all features.
If you want to have more control with your installation and are looking for fewer dependencies then install deepdoctection with the basic setup only.
pip install deepdoctection
This will ignore all model libraries (layers above the deepdoctection layer in the diagram) and you will be responsible to install them by yourself. Note, that you will not be able to run any pipeline with this setup.
For further information, please consult the full installation instructions.
Download the repository or clone via
git clone https://github.com/deepdoctection/deepdoctection.git
To get started with Tensorflow, run:
cd deepdoctection
pip install ".[tf]"
Installing the full PyTorch setup from source will also install Detectron2 for you:
cd deepdoctection
pip install ".[source-pt]"
Starting from release v.0.27.0
, pre-existing Docker images can be downloaded from the
Docker hub.
docker pull deepdoctection/deepdoctection:<release_tag>
To start the container, you can use the Docker compose file ./docker/pytorch-gpu/docker-compose.yaml
.
In the .env
file provided, specify the host directory where deepdoctection's cache should be stored.
This directory will be mounted. Additionally, specify a working directory to mount files to be processed into the
container.
docker compose up -d
will start the container.
We thank all libraries that provide high quality code and pre-trained models. Without, it would have been impossible to develop this framework.
We try hard to eliminate bugs. We also know that the code is not free of issues. We welcome all issues relevant to this repo and try to address them as quickly as possible. Bug fixes or enhancements will be deployed in a new release every 10 to 12 weeks.
...you can easily support the project by making it more visible. Leaving a star or a recommendation will help.
Distributed under the Apache 2.0 License. Check LICENSE for additional information.
FAQs
Repository for Document AI
We found that deepdoctection demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.