# DocLayout-YOLO: Advancing Document Layout Analysis with Mesh-candidate Bestfit and Global-to-local perception
Official PyTorch implementation of **DocLayout-YOLO**.
Zhiyuan Zhao, Hengrui Kang, Bin Wang, Conghui He
<details>
<summary>
<font size="+1">Abstract</font>
</summary>
We introduce DocLayout-YOLO, which not only enhances accuracy but also preserves the speed advantage through optimization from pre-training and model perspectives in a document-tailored manner. In terms of robust document pretraining, we innovatively regard document synthetic as a 2D bin packing problem and introduce Mesh-candidate Bestfit, which enables the generation of large-scale, diverse document datasets. The model, pre-trained on the resulting DocSynth300K dataset, significantly enhances fine-tuning performance across a variety of document types. In terms of model enhancement for document understanding, we propose a Global-to-local Controllable Receptive Module which emulates the human visual process from global to local perspectives and features a controllable module for feature extraction and integration. Experimental results on extensive downstream datasets show that the proposed DocLayout-YOLO excels at both speed and accuracy.
</details>
<p align="center">
<img src="assets/comp.png" width=52%>
<img src="assets/radar.png" width=44%> <br>
</p>
## Quick Start
### 1. Environment Setup
To set up your environment, follow these steps:
```bash
conda create -n doclayout_yolo python=3.10
conda activate doclayout_yolo
pip install -e .
```
**Note:** If you only need the package for inference, you can simply install it via pip:
```bash
pip install doclayout-yolo
```
### 2. Prediction
You can perform predictions using either a script or the SDK:
- **Script**
Run the following command to make a prediction using the script:
```bash
python demo.py --model path/to/model --image-path path/to/image
```
- **SDK**
Here is an example of how to use the SDK for prediction:
```python
import cv2
from doclayout_yolo import YOLOv10
# Load the pre-trained model
model = YOLOv10("path/to/provided/model")
# Perform prediction
det_res = model.predict(
"path/to/image", # Image to predict
imgsz=1024, # Prediction image size
conf=0.2, # Confidence threshold
device="cuda:0" # Device to use (e.g., 'cuda:0' or 'cpu')
)
# Annotate and save the result
annotated_frame = det_res[0].plot(pil=True, line_width=5, font_size=20)
cv2.imwrite("result.jpg", annotated_frame)
```
We provide model fine-tuned on **DocStructBench** for prediction, **which is capable of handing various document types**. Model can be downloaded from [here](https://huggingface.co/juliozhao/DocLayout-YOLO-DocStructBench/tree/main) and example images can be found under ```assets/example```.
<p align="center">
<img src="assets/showcase.png" width=100%> <br>
</p>
You also can use ```predict_single.py``` for prediction with custom inference settings. For batch process, please refer to [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit/tree/main).
## Training and Evaluation on Public DLA Datasets
### Data Preparation
1. specify data root path
Find your ultralytics config file (for Linux user in ```$HOME/.config/Ultralytics/settings.yaml)``` and change ```datasets_dir``` to project root path.
2. Download prepared yolo-format D4LA and doclaynet data from below and put to ```./layout_data```:
| Dataset | Download |
|:--:|:--:|
| D4LA | [link](https://huggingface.co/datasets/juliozhao/doclayout-yolo-D4LA) |
| DocLayNet | [link](https://huggingface.co/datasets/juliozhao/doclayout-yolo-DocLayNet) |
the file structure is as follows:
```bash
./layout_data
├── D4LA
│ ├── images
│ ├── labels
│ ├── test.txt
│ └── train.txt
└── doclaynet
├── images
├── labels
├── val.txt
└── train.txt
```
### Training and Evaluation
Training is conducted on 8 GPUs with a global batch size of 64 (8 images per device), detailed settings and checkpoints are as follows:
| Dataset | Model | DocSynth300K Pretrained? | imgsz | Learning rate | Finetune | Evaluation | AP50 | mAP | Checkpoint |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| D4LA | DocLayout-YOLO | ✗ | 1600 | 0.04 | [command](assets/script.sh#L5) | [command](assets/script.sh#L11) | 81.7 | 69.8 | [checkpoint](https://huggingface.co/juliozhao/DocLayout-YOLO-D4LA-from_scratch) |
| D4LA | DocLayout-YOLO | ✓ | 1600 | 0.04 | [command](assets/script.sh#L8) | [command](assets/script.sh#L11) | 82.4 | 70.3 | [checkpoint](https://huggingface.co/juliozhao/DocLayout-YOLO-D4LA-Docsynth300K_pretrained) |
| DocLayNet | DocLayout-YOLO | ✗ | 1120 | 0.02 | [command](assets/script.sh#L14) | [command](assets/script.sh#L20) | 93.0 | 77.7 | [checkpoint](https://huggingface.co/juliozhao/DocLayout-YOLO-DocLayNet-from_scratch) |
| DocLayNet | DocLayout-YOLO | ✓ | 1120 | 0.02 | [command](assets/script.sh#L17) | [command](assets/script.sh#L20) | 93.4 | 79.7 | [checkpoint](https://huggingface.co/juliozhao/DocLayout-YOLO-DocLayNet-Docsynth300K_pretrained) |
The DocSynth300K pretrained model can be downloaded from [here](https://huggingface.co/juliozhao/DocLayout-YOLO-DocSynth300K-pretrain). Change ```checkpoint.pt``` to the path of model to be evaluated during evaluation.
## Acknowledgement
The code base is built with [ultralytics](https://github.com/ultralytics/ultralytics) and [YOLO-v10](https://github.com/lyuwenyu/RT-DETR).
Thanks for these great work!
Raw data
{
"_id": null,
"home_page": null,
"name": "doclayout-yolo",
"maintainer": "Zhiyuan Zhao, Bin Wang",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "Document Layout Analysis, YOLO",
"author": "Zhiyuan Zhao, Hengrui Kang, Bin Wang, Conghui He",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/19/e8/319217d955dfdca67846d24865a8114b23b98301832adfd540a2bb8f9150/doclayout_yolo-0.0.2.tar.gz",
"platform": null,
"description": "# DocLayout-YOLO: Advancing Document Layout Analysis with Mesh-candidate Bestfit and Global-to-local perception\n\n\nOfficial PyTorch implementation of **DocLayout-YOLO**.\n\nZhiyuan Zhao, Hengrui Kang, Bin Wang, Conghui He\n\n<details>\n <summary>\n <font size=\"+1\">Abstract</font>\n </summary>\nWe introduce DocLayout-YOLO, which not only enhances accuracy but also preserves the speed advantage through optimization from pre-training and model perspectives in a document-tailored manner. In terms of robust document pretraining, we innovatively regard document synthetic as a 2D bin packing problem and introduce Mesh-candidate Bestfit, which enables the generation of large-scale, diverse document datasets. The model, pre-trained on the resulting DocSynth300K dataset, significantly enhances fine-tuning performance across a variety of document types. In terms of model enhancement for document understanding, we propose a Global-to-local Controllable Receptive Module which emulates the human visual process from global to local perspectives and features a controllable module for feature extraction and integration. Experimental results on extensive downstream datasets show that the proposed DocLayout-YOLO excels at both speed and accuracy.\n</details>\n\n<p align=\"center\">\n <img src=\"assets/comp.png\" width=52%>\n <img src=\"assets/radar.png\" width=44%> <br>\n</p>\n\n## Quick Start\n\n### 1. Environment Setup\n\nTo set up your environment, follow these steps:\n\n```bash\nconda create -n doclayout_yolo python=3.10\nconda activate doclayout_yolo\npip install -e .\n```\n\n**Note:** If you only need the package for inference, you can simply install it via pip:\n\n```bash\npip install doclayout-yolo\n```\n\n### 2. Prediction\n\nYou can perform predictions using either a script or the SDK:\n\n- **Script**\n\n Run the following command to make a prediction using the script:\n\n ```bash\n python demo.py --model path/to/model --image-path path/to/image\n ```\n\n- **SDK**\n\n Here is an example of how to use the SDK for prediction:\n\n ```python\n import cv2\n from doclayout_yolo import YOLOv10\n\n # Load the pre-trained model\n model = YOLOv10(\"path/to/provided/model\")\n\n # Perform prediction\n det_res = model.predict(\n \"path/to/image\", # Image to predict\n imgsz=1024, # Prediction image size\n conf=0.2, # Confidence threshold\n device=\"cuda:0\" # Device to use (e.g., 'cuda:0' or 'cpu')\n )\n\n # Annotate and save the result\n annotated_frame = det_res[0].plot(pil=True, line_width=5, font_size=20)\n cv2.imwrite(\"result.jpg\", annotated_frame)\n ```\n\n\nWe provide model fine-tuned on **DocStructBench** for prediction, **which is capable of handing various document types**. Model can be downloaded from [here](https://huggingface.co/juliozhao/DocLayout-YOLO-DocStructBench/tree/main) and example images can be found under ```assets/example```.\n\n<p align=\"center\">\n <img src=\"assets/showcase.png\" width=100%> <br>\n</p>\n\n\nYou also can use ```predict_single.py``` for prediction with custom inference settings. For batch process, please refer to [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit/tree/main).\n\n## Training and Evaluation on Public DLA Datasets\n\n### Data Preparation\n\n1. specify data root path\n\nFind your ultralytics config file (for Linux user in ```$HOME/.config/Ultralytics/settings.yaml)``` and change ```datasets_dir``` to project root path.\n\n2. Download prepared yolo-format D4LA and doclaynet data from below and put to ```./layout_data```:\n\n| Dataset | Download |\n|:--:|:--:|\n| D4LA | [link](https://huggingface.co/datasets/juliozhao/doclayout-yolo-D4LA) |\n| DocLayNet | [link](https://huggingface.co/datasets/juliozhao/doclayout-yolo-DocLayNet) |\n\nthe file structure is as follows:\n\n```bash\n./layout_data\n\u251c\u2500\u2500 D4LA\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 images\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 labels\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 test.txt\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 train.txt\n\u2514\u2500\u2500 doclaynet\n \u251c\u2500\u2500 images\n \u00a0\u00a0 \u251c\u2500\u2500 labels\n \u00a0\u00a0 \u251c\u2500\u2500 val.txt\n \u00a0\u00a0 \u2514\u2500\u2500 train.txt\n```\n\n### Training and Evaluation\n\nTraining is conducted on 8 GPUs with a global batch size of 64 (8 images per device), detailed settings and checkpoints are as follows:\n\n| Dataset | Model | DocSynth300K Pretrained? | imgsz | Learning rate | Finetune | Evaluation | AP50 | mAP | Checkpoint |\n|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|\n| D4LA | DocLayout-YOLO | ✗ | 1600 | 0.04 | [command](assets/script.sh#L5) | [command](assets/script.sh#L11) | 81.7 | 69.8 | [checkpoint](https://huggingface.co/juliozhao/DocLayout-YOLO-D4LA-from_scratch) |\n| D4LA | DocLayout-YOLO | ✓ | 1600 | 0.04 | [command](assets/script.sh#L8) | [command](assets/script.sh#L11) | 82.4 | 70.3 | [checkpoint](https://huggingface.co/juliozhao/DocLayout-YOLO-D4LA-Docsynth300K_pretrained) |\n| DocLayNet | DocLayout-YOLO | ✗ | 1120 | 0.02 | [command](assets/script.sh#L14) | [command](assets/script.sh#L20) | 93.0 | 77.7 | [checkpoint](https://huggingface.co/juliozhao/DocLayout-YOLO-DocLayNet-from_scratch) |\n| DocLayNet | DocLayout-YOLO | ✓ | 1120 | 0.02 | [command](assets/script.sh#L17) | [command](assets/script.sh#L20) | 93.4 | 79.7 | [checkpoint](https://huggingface.co/juliozhao/DocLayout-YOLO-DocLayNet-Docsynth300K_pretrained) |\n\nThe DocSynth300K pretrained model can be downloaded from [here](https://huggingface.co/juliozhao/DocLayout-YOLO-DocSynth300K-pretrain). Change ```checkpoint.pt``` to the path of model to be evaluated during evaluation.\n\n## Acknowledgement\n\nThe code base is built with [ultralytics](https://github.com/ultralytics/ultralytics) and [YOLO-v10](https://github.com/lyuwenyu/RT-DETR).\n\nThanks for these great work!\n",
"bugtrack_url": null,
"license": "AGPL-3.0",
"summary": "DocLayout-YOLO: an effecient and robust document layout analysis method.",
"version": "0.0.2",
"project_urls": null,
"split_keywords": [
"document layout analysis",
" yolo"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "32f7b6255e19d49a216af0d98d125eeec66e91821a20f2fe3d02456abb248309",
"md5": "dd1be1e9b33c33d279b6720815db730e",
"sha256": "9155d74be92c3a2441ac3dcd7263760637045480b8a4b71bde807976f9e47671"
},
"downloads": -1,
"filename": "doclayout_yolo-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "dd1be1e9b33c33d279b6720815db730e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 708156,
"upload_time": "2024-10-16T03:19:57",
"upload_time_iso_8601": "2024-10-16T03:19:57.324231Z",
"url": "https://files.pythonhosted.org/packages/32/f7/b6255e19d49a216af0d98d125eeec66e91821a20f2fe3d02456abb248309/doclayout_yolo-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "19e8319217d955dfdca67846d24865a8114b23b98301832adfd540a2bb8f9150",
"md5": "a8d0313f9e665f5fe333df10a95d4055",
"sha256": "c7ab629a8fd45eff6fb6c361cce65362b8326203fae891e17a95d731ce5c94c1"
},
"downloads": -1,
"filename": "doclayout_yolo-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "a8d0313f9e665f5fe333df10a95d4055",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 598193,
"upload_time": "2024-10-16T03:19:59",
"upload_time_iso_8601": "2024-10-16T03:19:59.214693Z",
"url": "https://files.pythonhosted.org/packages/19/e8/319217d955dfdca67846d24865a8114b23b98301832adfd540a2bb8f9150/doclayout_yolo-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-16 03:19:59",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "doclayout-yolo"
}