
Security News
Browserslist-rs Gets Major Refactor, Cutting Binary Size by Over 1MB
Browserslist-rs now uses static data to reduce binary size by over 1MB, improving memory use and performance for Rust-based frontend tools.
Version v.0.43
includes a significant redesign of the Analyzer's default configuration. Key changes include:
deepdoctection is a Python library that orchestrates Scan and PDF document layout analysis and extraction for RAG. It also provides a framework for training, evaluating and inferencing Document AI models.
Check the demo of a document layout analysis pipeline with OCR on 🤗 Hugging Face spaces.
Have a look at the introduction notebook for an easy start.
Check the release notes for recent updates.
import deepdoctection as dd
from IPython.core.display import HTML
from matplotlib import pyplot as plt
analyzer = dd.get_dd_analyzer() # instantiate the built-in analyzer similar to the Hugging Face space demo
df = analyzer.analyze(path = "/path/to/your/doc.pdf") # setting up pipeline
df.reset_state() # Trigger some initialization
doc = iter(df)
page = next(doc)
image = page.viz(show_figures=True, show_residual_layouts=True)
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)
HTML(page.tables[0].html)
print(page.text)
Task | PyTorch | Torchscript | Tensorflow |
---|---|---|---|
Layout detection via Detectron2/Tensorpack | ✅ | ✅ (CPU only) | ✅ (GPU only) |
Table recognition via Detectron2/Tensorpack | ✅ | ✅ (CPU only) | ✅ (GPU only) |
Table transformer via Transformers | ✅ | ❌ | ❌ |
Deformable-Detr | ✅ | ❌ | ❌ |
DocTr | ✅ | ❌ | ✅ |
LayoutLM (v1, v2, v3, XLM) via Transformers | ✅ | ❌ | ❌ |
We recommend using a virtual environment.
For a simple setup which is enough to parse documents with the default setting, install the following:
PyTorch
pip install transformers
pip install python-doctr
pip install deepdoctection
TensorFlow
pip install tensorpack
pip install python-doctr
pip install deepdoctection
Both setups are sufficient to run the introduction notebook.
The following installation will give you ALL models available within the Deep Learning framework as well as all models that are independent of Tensorflow/PyTorch.
PyTorch
First install Detectron2 separately as it is not distributed via PyPi. Check the instruction here or try:
pip install detectron2@git+https://github.com/deepdoctection/detectron2.git
Then install deepdoctection with all its dependencies:
pip install deepdoctection[pt]
Tensorflow
pip install deepdoctection[tf]
For further information, please consult the full installation instructions.
Download the repository or clone via
git clone https://github.com/deepdoctection/deepdoctection.git
PyTorch
cd deepdoctection
pip install ".[pt]" # or "pip install -e .[pt]"
Tensorflow
cd deepdoctection
pip install ".[tf]" # or "pip install -e .[tf]"
Pre-existing Docker images can be downloaded from the Docker hub.
docker pull deepdoctection/deepdoctection:<release_tag>
Use the Docker compose file ./docker/pytorch-gpu/docker-compose.yaml
.
In the .env
file provided, specify the host directory where deepdoctection's cache should be stored.
Additionally, specify a working directory to mount files to be processed into the container.
docker compose up -d
will start the container. There is no endpoint exposed, though.
We thank all libraries that provide high quality code and pre-trained models. Without, it would have been impossible to develop this framework.
...you can easily support the project by making it more visible. Leaving a star or a recommendation will help.
Distributed under the Apache 2.0 License. Check LICENSE for additional information.
FAQs
Repository for Document AI
We found that deepdoctection demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Browserslist-rs now uses static data to reduce binary size by over 1MB, improving memory use and performance for Rust-based frontend tools.
Research
Security News
Eight new malicious Firefox extensions impersonate games, steal OAuth tokens, hijack sessions, and exploit browser permissions to spy on users.
Security News
The official Go SDK for the Model Context Protocol is in development, with a stable, production-ready release expected by August 2025.