<p align="center">
<img src="https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/dd_logo.png" alt="Deep Doctection Logo" width="60%">
</p>



------------------------------------------------------------------------------------------------------------------------
# NEW
Version `v.0.43` includes a significant redesign of the Analyzer's default configuration. Key changes include:
* More powerful models for Document Layout Analysis and OCR.
* Expanded functionality.
* Less dependencies.
------------------------------------------------------------------------------------------------------------------------
<p align="center">
<h1 align="center">
A Package for Document Understanding
</h1>
</p>
**deep**doctection is a Python library that orchestrates Scan and PDF document layout analysis and extraction for RAG.
It also provides a framework for training, evaluating and inferencing Document AI models.
# Overview
- Document layout analysis and table recognition in PyTorch with
[**Detectron2**](https://github.com/facebookresearch/detectron2/tree/main/detectron2) and
[**Transformers**](https://github.com/huggingface/transformers)
or Tensorflow and [**Tensorpack**](https://github.com/tensorpack),
- OCR with support of [**Tesseract**](https://github.com/tesseract-ocr/tesseract), [**DocTr**](https://github.com/mindee/doctr) and
[**AWS Textract**](https://aws.amazon.com/textract/),
- Document and token classification with the [**LayoutLM**](https://github.com/microsoft/unilm) family,
[**LiLT**](https://github.com/jpWang/LiLT) and selected
[**Bert**](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)-style including features like sliding windows.
- Text mining for native PDFs with [**pdfplumber**](https://github.com/jsvine/pdfplumber),
- Language detection with [**fastText**](https://github.com/facebookresearch/fastText),
- Deskewing and rotating images with [**jdeskew**](https://github.com/phamquiluan/jdeskew).
- Fine-tuning and evaluation tools.
- Lot's of [tutorials](https://github.com/deepdoctection/notebooks)
Have a look at the [**introduction notebook**](https://github.com/deepdoctection/notebooks/blob/main/Analyzer_Get_Started.ipynb)
for an easy start.
Check the [**release notes**](https://github.com/deepdoctection/deepdoctection/releases) for recent updates.
----------------------------------------------------------------------------------------
# Hugging Face Space Demo
Check the demo of a document layout analysis pipeline with OCR on 🤗
[**Hugging Face spaces**](https://huggingface.co/spaces/deepdoctection/deepdoctection) or use the gradio client.
```
pip install gradio_client # requires Python >= 3.10
```
To process a single image:
```python
from gradio_client import Client, handle_file
if __name__ == "__main__":
client = Client("deepdoctection/deepdoctection")
result = client.predict(
img=handle_file('/local_path/to/dir/file_name.jpeg'), # accepts image files, e.g. JPEG, PNG
pdf=None,
max_datapoints = 2,
api_name = "/analyze_image"
)
print(result)
```
To process a PDF document:
```python
from gradio_client import Client, handle_file
if __name__ == "__main__":
client = Client("deepdoctection/deepdoctection")
result = client.predict(
img=None,
pdf=handle_file("/local_path/to/dir/your_doc.pdf"),
max_datapoints = 2, # increase to process up to 9 pages
api_name = "/analyze_image"
)
print(result)
```
--------------------------------------------------------------------------------------------------------
# Example
```python
import deepdoctection as dd
from IPython.core.display import HTML
from matplotlib import pyplot as plt
analyzer = dd.get_dd_analyzer() # instantiate the built-in analyzer similar to the Hugging Face space demo
df = analyzer.analyze(path = "/path/to/your/doc.pdf") # setting up pipeline
df.reset_state() # Trigger some initialization
doc = iter(df)
page = next(doc)
image = page.viz(show_figures=True, show_residual_layouts=True)
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)
```
<p align="center">
<img src="https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/dd_rm_sample.png"
alt="sample" width="40%">
</p>
```
HTML(page.tables[0].html)
```
<p align="center">
<img src="https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/dd_rm_table.png"
alt="table" width="40%">
</p>
```
print(page.text)
```
<p align="center">
<img src="https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/dd_rm_text.png"
alt="text" width="40%">
</p>
-----------------------------------------------------------------------------------------
# Requirements

- Linux or macOS. Windows is not supported but there is a [Dockerfile](./docker/pytorch-cpu-jupyter/Dockerfile) available.
- Python >= 3.9
- 2.2 \<= PyTorch **or** 2.11 \<= Tensorflow < 2.16. (For lower Tensorflow versions the code will only run on a GPU).
Tensorflow support will be stopped from Python 3.11 onwards.
- To fine-tune models, a GPU is recommended.
| Task | PyTorch | Torchscript | Tensorflow |
|---------------------------------------------|:-------:|----------------|:------------:|
| Layout detection via Detectron2/Tensorpack | ✅ | ✅ (CPU only) | ✅ (GPU only) |
| Table recognition via Detectron2/Tensorpack | ✅ | ✅ (CPU only) | ✅ (GPU only) |
| Table transformer via Transformers | ✅ | ❌ | ❌ |
| Deformable-Detr | ✅ | ❌ | ❌ |
| DocTr | ✅ | ❌ | ✅ |
| LayoutLM (v1, v2, v3, XLM) via Transformers | ✅ | ❌ | ❌ |
------------------------------------------------------------------------------------------
# Installation
We recommend using a virtual environment.
## Get started installation
For a simple setup which is enough to parse documents with the default setting, install the following:
**PyTorch**
```
pip install transformers
pip install python-doctr==0.9.0
pip install deepdoctection
```
**TensorFlow**
```
pip install tensorpack
pip install python-doctr==0.9.0
pip install deepdoctection
```
Both setups are sufficient to run the [**introduction notebook**](https://github.com/deepdoctection/notebooks/blob/main/Get_Started.ipynb).
### Full installation
The following installation will give you ALL models available within the Deep Learning framework as well as all models
that are independent of Tensorflow/PyTorch.
**PyTorch**
First install **Detectron2** separately as it is not distributed via PyPi. Check the instruction
[here](https://detectron2.readthedocs.io/en/latest/tutorials/install.html) or try:
```
pip install detectron2@git+https://github.com/deepdoctection/detectron2.git
```
Then install **deep**doctection with all its dependencies:
```
pip install deepdoctection[pt]
```
**Tensorflow**
```
pip install deepdoctection[tf]
```
For further information, please consult the [**full installation instructions**](https://deepdoctection.readthedocs.io/en/latest/install/).
## Installation from source
Download the repository or clone via
```
git clone https://github.com/deepdoctection/deepdoctection.git
```
**PyTorch**
```
cd deepdoctection
pip install ".[pt]" # or "pip install -e .[pt]"
```
**Tensorflow**
```
cd deepdoctection
pip install ".[tf]" # or "pip install -e .[tf]"
```
## Running a Docker container from Docker hub
Pre-existing Docker images can be downloaded from the [Docker hub](https://hub.docker.com/r/deepdoctection/deepdoctection).
```
docker pull deepdoctection/deepdoctection:<release_tag>
```
Use the Docker compose file `./docker/pytorch-gpu/docker-compose.yaml`.
In the `.env` file provided, specify the host directory where **deep**doctection's cache should be stored.
Additionally, specify a working directory to mount files to be processed into the container.
```
docker compose up -d
```
will start the container. There is no endpoint exposed, though.
-----------------------------------------------------------------------------------------------
# Credits
We thank all libraries that provide high quality code and pre-trained models. Without, it would have been impossible
to develop this framework.
# If you like **deep**doctection ...
...you can easily support the project by making it more visible. Leaving a star or a recommendation will help.
# License
Distributed under the Apache 2.0 License. Check [LICENSE](https://github.com/deepdoctection/deepdoctection/blob/master/LICENSE) for additional information.
Raw data
{
"_id": null,
"home_page": "https://github.com/deepdoctection/deepdoctection",
"name": "deepdoctection",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": null,
"author": "Dr. Janis Meyer",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/a0/a9/905917080f8efccc05670a211290063ae031bb7d9a8e9547147b28921927/deepdoctection-0.44.0.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <img src=\"https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/dd_logo.png\" alt=\"Deep Doctection Logo\" width=\"60%\">\n</p>\n\n\n\n\n\n\n------------------------------------------------------------------------------------------------------------------------\n# NEW \n\nVersion `v.0.43` includes a significant redesign of the Analyzer's default configuration. Key changes include:\n\n* More powerful models for Document Layout Analysis and OCR.\n* Expanded functionality.\n* Less dependencies.\n\n------------------------------------------------------------------------------------------------------------------------\n\n<p align=\"center\">\n <h1 align=\"center\">\n A Package for Document Understanding\n </h1>\n</p>\n\n\n**deep**doctection is a Python library that orchestrates Scan and PDF document layout analysis and extraction for RAG.\nIt also provides a framework for training, evaluating and inferencing Document AI models.\n\n# Overview\n\n- Document layout analysis and table recognition in PyTorch with \n[**Detectron2**](https://github.com/facebookresearch/detectron2/tree/main/detectron2) and \n[**Transformers**](https://github.com/huggingface/transformers)\n or Tensorflow and [**Tensorpack**](https://github.com/tensorpack),\n- OCR with support of [**Tesseract**](https://github.com/tesseract-ocr/tesseract), [**DocTr**](https://github.com/mindee/doctr) and \n [**AWS Textract**](https://aws.amazon.com/textract/),\n- Document and token classification with the [**LayoutLM**](https://github.com/microsoft/unilm) family,\n [**LiLT**](https://github.com/jpWang/LiLT) and selected\n [**Bert**](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)-style including features like sliding windows.\n- Text mining for native PDFs with [**pdfplumber**](https://github.com/jsvine/pdfplumber),\n- Language detection with [**fastText**](https://github.com/facebookresearch/fastText),\n- Deskewing and rotating images with [**jdeskew**](https://github.com/phamquiluan/jdeskew).\n- Fine-tuning and evaluation tools.\n- Lot's of [tutorials](https://github.com/deepdoctection/notebooks)\n\nHave a look at the [**introduction notebook**](https://github.com/deepdoctection/notebooks/blob/main/Analyzer_Get_Started.ipynb)\nfor an easy start.\n\nCheck the [**release notes**](https://github.com/deepdoctection/deepdoctection/releases) for recent updates.\n\n\n----------------------------------------------------------------------------------------\n\n# Hugging Face Space Demo\n\nCheck the demo of a document layout analysis pipeline with OCR on \ud83e\udd17\n[**Hugging Face spaces**](https://huggingface.co/spaces/deepdoctection/deepdoctection) or use the gradio client. \n\n```\npip install gradio_client # requires Python >= 3.10 \n```\n\nTo process a single image:\n\n```python\nfrom gradio_client import Client, handle_file\n\nif __name__ == \"__main__\":\n\n client = Client(\"deepdoctection/deepdoctection\")\n result = client.predict(\n img=handle_file('/local_path/to/dir/file_name.jpeg'), # accepts image files, e.g. JPEG, PNG\n pdf=None, \n max_datapoints = 2,\n api_name = \"/analyze_image\"\n )\n print(result)\n```\n\nTo process a PDF document:\n\n```python\nfrom gradio_client import Client, handle_file\n\nif __name__ == \"__main__\":\n\n client = Client(\"deepdoctection/deepdoctection\")\n result = client.predict(\n img=None,\n pdf=handle_file(\"/local_path/to/dir/your_doc.pdf\"),\n max_datapoints = 2, # increase to process up to 9 pages\n api_name = \"/analyze_image\"\n )\n print(result)\n```\n\n--------------------------------------------------------------------------------------------------------\n\n# Example\n\n```python\nimport deepdoctection as dd\nfrom IPython.core.display import HTML\nfrom matplotlib import pyplot as plt\n\nanalyzer = dd.get_dd_analyzer() # instantiate the built-in analyzer similar to the Hugging Face space demo\n\ndf = analyzer.analyze(path = \"/path/to/your/doc.pdf\") # setting up pipeline\ndf.reset_state() # Trigger some initialization\n\ndoc = iter(df)\npage = next(doc) \n\nimage = page.viz(show_figures=True, show_residual_layouts=True)\nplt.figure(figsize = (25,17))\nplt.axis('off')\nplt.imshow(image)\n```\n\n<p align=\"center\">\n <img src=\"https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/dd_rm_sample.png\" \nalt=\"sample\" width=\"40%\">\n</p>\n\n```\nHTML(page.tables[0].html)\n```\n\n<p align=\"center\">\n <img src=\"https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/dd_rm_table.png\" \nalt=\"table\" width=\"40%\">\n</p>\n\n```\nprint(page.text)\n```\n\n<p align=\"center\">\n <img src=\"https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/dd_rm_text.png\" \nalt=\"text\" width=\"40%\">\n</p>\n\n\n-----------------------------------------------------------------------------------------\n\n# Requirements\n\n\n\n- Linux or macOS. Windows is not supported but there is a [Dockerfile](./docker/pytorch-cpu-jupyter/Dockerfile) available.\n- Python >= 3.9\n- 2.2 \\<= PyTorch **or** 2.11 \\<= Tensorflow < 2.16. (For lower Tensorflow versions the code will only run on a GPU).\n Tensorflow support will be stopped from Python 3.11 onwards.\n- To fine-tune models, a GPU is recommended.\n\n| Task | PyTorch | Torchscript | Tensorflow |\n|---------------------------------------------|:-------:|----------------|:------------:|\n| Layout detection via Detectron2/Tensorpack | \u2705 | \u2705 (CPU only) | \u2705 (GPU only) |\n| Table recognition via Detectron2/Tensorpack | \u2705 | \u2705 (CPU only) | \u2705 (GPU only) |\n| Table transformer via Transformers | \u2705 | \u274c | \u274c |\n| Deformable-Detr | \u2705 | \u274c | \u274c |\n| DocTr | \u2705 | \u274c | \u2705 |\n| LayoutLM (v1, v2, v3, XLM) via Transformers | \u2705 | \u274c | \u274c |\n\n------------------------------------------------------------------------------------------\n\n# Installation\n\nWe recommend using a virtual environment.\n\n## Get started installation\n\nFor a simple setup which is enough to parse documents with the default setting, install the following:\n\n**PyTorch**\n\n```\npip install transformers\npip install python-doctr==0.9.0\npip install deepdoctection\n```\n\n**TensorFlow**\n\n```\npip install tensorpack\npip install python-doctr==0.9.0\npip install deepdoctection\n```\n\nBoth setups are sufficient to run the [**introduction notebook**](https://github.com/deepdoctection/notebooks/blob/main/Get_Started.ipynb).\n\n### Full installation\n\nThe following installation will give you ALL models available within the Deep Learning framework as well as all models\nthat are independent of Tensorflow/PyTorch.\n\n**PyTorch**\n\nFirst install **Detectron2** separately as it is not distributed via PyPi. Check the instruction\n[here](https://detectron2.readthedocs.io/en/latest/tutorials/install.html) or try:\n\n```\npip install detectron2@git+https://github.com/deepdoctection/detectron2.git\n```\n\nThen install **deep**doctection with all its dependencies:\n\n```\npip install deepdoctection[pt]\n```\n\n**Tensorflow**\n\n```\npip install deepdoctection[tf]\n```\n\n\nFor further information, please consult the [**full installation instructions**](https://deepdoctection.readthedocs.io/en/latest/install/).\n\n\n## Installation from source\n\nDownload the repository or clone via\n\n```\ngit clone https://github.com/deepdoctection/deepdoctection.git\n```\n\n**PyTorch**\n\n```\ncd deepdoctection\npip install \".[pt]\" # or \"pip install -e .[pt]\"\n```\n\n**Tensorflow**\n\n```\ncd deepdoctection\npip install \".[tf]\" # or \"pip install -e .[tf]\"\n```\n\n\n## Running a Docker container from Docker hub\n\nPre-existing Docker images can be downloaded from the [Docker hub](https://hub.docker.com/r/deepdoctection/deepdoctection).\n\n```\ndocker pull deepdoctection/deepdoctection:<release_tag> \n```\n\nUse the Docker compose file `./docker/pytorch-gpu/docker-compose.yaml`.\nIn the `.env` file provided, specify the host directory where **deep**doctection's cache should be stored.\nAdditionally, specify a working directory to mount files to be processed into the container.\n\n```\ndocker compose up -d\n```\n\nwill start the container. There is no endpoint exposed, though.\n\n-----------------------------------------------------------------------------------------------\n\n# Credits\n\nWe thank all libraries that provide high quality code and pre-trained models. Without, it would have been impossible\nto develop this framework.\n\n\n# If you like **deep**doctection ...\n\n...you can easily support the project by making it more visible. Leaving a star or a recommendation will help.\n\n# License\n\nDistributed under the Apache 2.0 License. Check [LICENSE](https://github.com/deepdoctection/deepdoctection/blob/master/LICENSE) for additional information.\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "Repository for Document AI",
"version": "0.44.0",
"project_urls": {
"Homepage": "https://github.com/deepdoctection/deepdoctection"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "f36d936b3a00f4f5c7132de8a680451afd852a5c0e6644255ce128aa6a2402dc",
"md5": "9d116c125f977ad0c88951b9d26d63ac",
"sha256": "9b1991c1ec0a5bf74e294713f052a895e53b5335ab8d3bb99afa85980f71045a"
},
"downloads": -1,
"filename": "deepdoctection-0.44.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9d116c125f977ad0c88951b9d26d63ac",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 474687,
"upload_time": "2025-07-29T11:34:06",
"upload_time_iso_8601": "2025-07-29T11:34:06.811432Z",
"url": "https://files.pythonhosted.org/packages/f3/6d/936b3a00f4f5c7132de8a680451afd852a5c0e6644255ce128aa6a2402dc/deepdoctection-0.44.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a0a9905917080f8efccc05670a211290063ae031bb7d9a8e9547147b28921927",
"md5": "c3a2a37f6904462db4b6ef1428fb2ce2",
"sha256": "ad3b8c6cd62b54262c003d07c0f21596e3fbd6d7d0178beddc1b2b80eb4b5d8f"
},
"downloads": -1,
"filename": "deepdoctection-0.44.0.tar.gz",
"has_sig": false,
"md5_digest": "c3a2a37f6904462db4b6ef1428fb2ce2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 357276,
"upload_time": "2025-07-29T11:34:09",
"upload_time_iso_8601": "2025-07-29T11:34:09.100135Z",
"url": "https://files.pythonhosted.org/packages/a0/a9/905917080f8efccc05670a211290063ae031bb7d9a8e9547147b28921927/deepdoctection-0.44.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-29 11:34:09",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "deepdoctection",
"github_project": "deepdoctection",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "attrs",
"specs": [
[
"==",
"21.4.0"
]
]
},
{
"name": "catalogue",
"specs": [
[
"==",
"2.0.10"
]
]
},
{
"name": "certifi",
"specs": [
[
"==",
"2021.10.8"
]
]
},
{
"name": "charset-normalizer",
"specs": [
[
"==",
"2.0.12"
]
]
},
{
"name": "filelock",
"specs": [
[
"==",
"3.6.0"
]
]
},
{
"name": "fsspec",
"specs": [
[
"==",
"2023.9.2"
]
]
},
{
"name": "huggingface-hub",
"specs": [
[
"==",
"0.28.1"
]
]
},
{
"name": "idna",
"specs": [
[
"==",
"3.3"
]
]
},
{
"name": "importlib-metadata",
"specs": [
[
"==",
"7.1.0"
]
]
},
{
"name": "jsonlines",
"specs": [
[
"==",
"3.1.0"
]
]
},
{
"name": "lazy-imports",
"specs": [
[
"==",
"0.3.1"
]
]
},
{
"name": "mock",
"specs": [
[
"==",
"4.0.3"
]
]
},
{
"name": "networkx",
"specs": [
[
"==",
"2.7.1"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"1.26.4"
]
]
},
{
"name": "packaging",
"specs": [
[
"==",
"21.3"
]
]
},
{
"name": "pillow",
"specs": [
[
"==",
"10.0.1"
]
]
},
{
"name": "pyparsing",
"specs": [
[
"==",
"3.0.7"
]
]
},
{
"name": "pypdf",
"specs": [
[
"==",
"3.17.3"
]
]
},
{
"name": "pypdfium2",
"specs": [
[
"==",
"4.30.0"
]
]
},
{
"name": "pyyaml",
"specs": [
[
"==",
"6.0.1"
]
]
},
{
"name": "pyzmq",
"specs": [
[
"==",
"24.0.1"
]
]
},
{
"name": "requests",
"specs": [
[
"==",
"2.27.1"
]
]
},
{
"name": "scipy",
"specs": [
[
"==",
"1.13.1"
]
]
},
{
"name": "tabulate",
"specs": [
[
"==",
"0.8.10"
]
]
},
{
"name": "termcolor",
"specs": [
[
"==",
"2.0.1"
]
]
},
{
"name": "tqdm",
"specs": [
[
"==",
"4.64.0"
]
]
},
{
"name": "typing-extensions",
"specs": [
[
"==",
"4.1.1"
]
]
},
{
"name": "urllib3",
"specs": [
[
"==",
"1.26.8"
]
]
},
{
"name": "zipp",
"specs": [
[
"==",
"3.7.0"
]
]
}
],
"lcname": "deepdoctection"
}