DocLayout-YOLO: Advancing Document Layout Analysis with Mesh-candidate Bestfit and Global-to-local perception
Official PyTorch implementation of DocLayout-YOLO.
Zhiyuan Zhao, Hengrui Kang, Bin Wang, Conghui He
Abstract
We introduce DocLayout-YOLO, which not only enhances accuracy but also preserves the speed advantage through optimization from pre-training and model perspectives in a document-tailored manner. In terms of robust document pretraining, we innovatively regard document synthetic as a 2D bin packing problem and introduce Mesh-candidate Bestfit, which enables the generation of large-scale, diverse document datasets. The model, pre-trained on the resulting DocSynth300K dataset, significantly enhances fine-tuning performance across a variety of document types. In terms of model enhancement for document understanding, we propose a Global-to-local Controllable Receptive Module which emulates the human visual process from global to local perspectives and features a controllable module for feature extraction and integration. Experimental results on extensive downstream datasets show that the proposed DocLayout-YOLO excels at both speed and accuracy.
Quick Start
1. Environment Setup
To set up your environment, follow these steps:
conda create -n doclayout_yolo python=3.10
conda activate doclayout_yolo
pip install -e .
Note: If you only need the package for inference, you can simply install it via pip:
pip install doclayout-yolo
2. Prediction
You can perform predictions using either a script or the SDK:
-
Script
Run the following command to make a prediction using the script:
python demo.py --model path/to/model --image-path path/to/image
-
SDK
Here is an example of how to use the SDK for prediction:
import cv2
from doclayout_yolo import YOLOv10
model = YOLOv10("path/to/provided/model")
det_res = model.predict(
"path/to/image",
imgsz=1024,
conf=0.2,
device="cuda:0"
)
annotated_frame = det_res[0].plot(pil=True, line_width=5, font_size=20)
cv2.imwrite("result.jpg", annotated_frame)
We provide model fine-tuned on DocStructBench for prediction, which is capable of handing various document types. Model can be downloaded from here and example images can be found under assets/example
.
You also can use predict_single.py
for prediction with custom inference settings. For batch process, please refer to PDF-Extract-Kit.
Training and Evaluation on Public DLA Datasets
Data Preparation
- specify data root path
Find your ultralytics config file (for Linux user in $HOME/.config/Ultralytics/settings.yaml)
and change datasets_dir
to project root path.
- Download prepared yolo-format D4LA and doclaynet data from below and put to
./layout_data
:
the file structure is as follows:
./layout_data
├── D4LA
│ ├── images
│ ├── labels
│ ├── test.txt
│ └── train.txt
└── doclaynet
├── images
├── labels
├── val.txt
└── train.txt
Training and Evaluation
Training is conducted on 8 GPUs with a global batch size of 64 (8 images per device), detailed settings and checkpoints are as follows:
The DocSynth300K pretrained model can be downloaded from here. Change checkpoint.pt
to the path of model to be evaluated during evaluation.
Acknowledgement
The code base is built with ultralytics and YOLO-v10.
Thanks for these great work!