# doc page extractor
English | [中文](./README_zh-CN.md)
## Introduction
doc page extractor can identify text and format in images and return structured data.
## Installation
```shell
pip install doc-page-extractor
```
## Using CUDA
Please refer to the introduction of [PyTorch](https://pytorch.org/get-started/locally/) and select the appropriate command to install according to your operating system.
## Example
```python
from PIL import Image
from doc_page_extractor import DocExtractor
extractor = DocExtractor(
model_dir_path=model_path, # Folder address where AI model is downloaded and installed
device="cpu", # If you want to use CUDA, please change to device="cuda:0".
)
with Image.open("/path/to/your/image.png") as image:
result = extractor.extract(
image=image,
lang="ch", # Language of image text
)
for layout in result.layouts:
for fragment in layout.fragments:
print(fragment.rect, fragment.text)
```
## Acknowledgements
The code of `doc_page_extractor/onnxocr` in this repo comes from [OnnxOCR](https://github.com/jingsongliujing/OnnxOCR).
- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
- [OnnxOCR](https://github.com/jingsongliujing/OnnxOCR)
- [layoutreader](https://github.com/ppaanngggg/layoutreader)
Raw data
{
"_id": null,
"home_page": "https://github.com/Moskize91/doc-page-extractor",
"name": "doc-page-extractor",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Tao Zeyu",
"author_email": "i@taozeyu.com",
"download_url": "https://files.pythonhosted.org/packages/a3/71/9eb58bb801f0ef03a71b7050aee03e13d6cae01886421f888b5ba473d5cc/doc_page_extractor-0.0.6.tar.gz",
"platform": null,
"description": "# doc page extractor\n\nEnglish | [\u4e2d\u6587](./README_zh-CN.md)\n\n## Introduction\n\ndoc page extractor can identify text and format in images and return structured data.\n\n## Installation\n\n```shell\npip install doc-page-extractor\n```\n\n## Using CUDA\n\nPlease refer to the introduction of [PyTorch](https://pytorch.org/get-started/locally/) and select the appropriate command to install according to your operating system.\n\n## Example\n\n```python\nfrom PIL import Image\nfrom doc_page_extractor import DocExtractor\n\nextractor = DocExtractor(\n model_dir_path=model_path, # Folder address where AI model is downloaded and installed\n device=\"cpu\", # If you want to use CUDA, please change to device=\"cuda:0\".\n)\nwith Image.open(\"/path/to/your/image.png\") as image:\n result = extractor.extract(\n image=image,\n lang=\"ch\", # Language of image text\n)\nfor layout in result.layouts:\n for fragment in layout.fragments:\n print(fragment.rect, fragment.text)\n```\n\n## Acknowledgements\n\nThe code of `doc_page_extractor/onnxocr` in this repo comes from [OnnxOCR](https://github.com/jingsongliujing/OnnxOCR).\n\n- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)\n- [OnnxOCR](https://github.com/jingsongliujing/OnnxOCR)\n- [layoutreader](https://github.com/ppaanngggg/layoutreader)\n",
"bugtrack_url": null,
"license": null,
"summary": "doc page extractor can identify text and format in images and return structured data.",
"version": "0.0.6",
"project_urls": {
"Homepage": "https://github.com/Moskize91/doc-page-extractor"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "0c6182b0b3e229aa953a7201301d14974c807193e103b7ea4d485a630fa857e9",
"md5": "1332cfae03ac30322c1a5f45446a5b42",
"sha256": "39b75c3ee3afd20b431ee4c130882a9260a22d7cb1702dbd11bd5d837cac7ed9"
},
"downloads": -1,
"filename": "doc_page_extractor-0.0.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1332cfae03ac30322c1a5f45446a5b42",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 50619,
"upload_time": "2025-03-12T02:31:21",
"upload_time_iso_8601": "2025-03-12T02:31:21.911949Z",
"url": "https://files.pythonhosted.org/packages/0c/61/82b0b3e229aa953a7201301d14974c807193e103b7ea4d485a630fa857e9/doc_page_extractor-0.0.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a3719eb58bb801f0ef03a71b7050aee03e13d6cae01886421f888b5ba473d5cc",
"md5": "decf4b6ff1615316452ffb47a8e81cd7",
"sha256": "0fd4a9192fbec1c2eef0fd1213122d13c3536133d3cd0936814bcbef318cec70"
},
"downloads": -1,
"filename": "doc_page_extractor-0.0.6.tar.gz",
"has_sig": false,
"md5_digest": "decf4b6ff1615316452ffb47a8e81cd7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 43183,
"upload_time": "2025-03-12T02:31:23",
"upload_time_iso_8601": "2025-03-12T02:31:23.287308Z",
"url": "https://files.pythonhosted.org/packages/a3/71/9eb58bb801f0ef03a71b7050aee03e13d6cae01886421f888b5ba473d5cc/doc_page_extractor-0.0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-03-12 02:31:23",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Moskize91",
"github_project": "doc-page-extractor",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "opencv-python",
"specs": [
[
"==",
"4.11.0.86"
]
]
},
{
"name": "pyclipper",
"specs": [
[
"==",
"1.3.0.post6"
]
]
},
{
"name": "onnxruntime",
"specs": [
[
"==",
"1.21.0"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"1.26.4"
]
]
},
{
"name": "pillow",
"specs": [
[
"==",
"10.3.0"
]
]
},
{
"name": "shapely",
"specs": [
[
"==",
"2.0.6"
]
]
},
{
"name": "transformers",
"specs": [
[
"==",
"4.48.2"
]
]
},
{
"name": "doclayout_yolo",
"specs": [
[
"==",
"0.0.3"
]
]
}
],
"lcname": "doc-page-extractor"
}