doc-page-extractor

Name	doc-page-extractor JSON
Version	0.0.6 JSON
	download
home_page	https://github.com/Moskize91/doc-page-extractor
Summary	doc page extractor can identify text and format in images and return structured data.
upload_time	2025-03-12 02:31:23
maintainer	None
docs_url	None
author	Tao Zeyu
requires_python	None
license	None
keywords
VCS
bugtrack_url
requirements	opencv-python pyclipper onnxruntime numpy pillow shapely transformers doclayout_yolo
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # doc page extractor

English | [中文](./README_zh-CN.md)

## Introduction

doc page extractor can identify text and format in images and return structured data.

## Installation

```shell
pip install doc-page-extractor
```

## Using CUDA

Please refer to the introduction of [PyTorch](https://pytorch.org/get-started/locally/) and select the appropriate command to install according to your operating system.

## Example

```python
from PIL import Image
from doc_page_extractor import DocExtractor

extractor = DocExtractor(
  model_dir_path=model_path, # Folder address where AI model is downloaded and installed
  device="cpu", # If you want to use CUDA, please change to device="cuda:0".
)
with Image.open("/path/to/your/image.png") as image:
  result = extractor.extract(
  image=image,
  lang="ch", # Language of image text
)
for layout in result.layouts:
  for fragment in layout.fragments:
    print(fragment.rect, fragment.text)
```

## Acknowledgements

The code of `doc_page_extractor/onnxocr` in this repo comes from [OnnxOCR](https://github.com/jingsongliujing/OnnxOCR).

- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
- [OnnxOCR](https://github.com/jingsongliujing/OnnxOCR)
- [layoutreader](https://github.com/ppaanngggg/layoutreader)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Moskize91/doc-page-extractor",
    "name": "doc-page-extractor",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Tao Zeyu",
    "author_email": "i@taozeyu.com",
    "download_url": "https://files.pythonhosted.org/packages/a3/71/9eb58bb801f0ef03a71b7050aee03e13d6cae01886421f888b5ba473d5cc/doc_page_extractor-0.0.6.tar.gz",
    "platform": null,
    "description": "# doc page extractor\n\nEnglish | [\u4e2d\u6587](./README_zh-CN.md)\n\n## Introduction\n\ndoc page extractor can identify text and format in images and return structured data.\n\n## Installation\n\n```shell\npip install doc-page-extractor\n```\n\n## Using CUDA\n\nPlease refer to the introduction of [PyTorch](https://pytorch.org/get-started/locally/) and select the appropriate command to install according to your operating system.\n\n## Example\n\n```python\nfrom PIL import Image\nfrom doc_page_extractor import DocExtractor\n\nextractor = DocExtractor(\n  model_dir_path=model_path, # Folder address where AI model is downloaded and installed\n  device=\"cpu\", # If you want to use CUDA, please change to device=\"cuda:0\".\n)\nwith Image.open(\"/path/to/your/image.png\") as image:\n  result = extractor.extract(\n  image=image,\n  lang=\"ch\", # Language of image text\n)\nfor layout in result.layouts:\n  for fragment in layout.fragments:\n    print(fragment.rect, fragment.text)\n```\n\n## Acknowledgements\n\nThe code of `doc_page_extractor/onnxocr` in this repo comes from [OnnxOCR](https://github.com/jingsongliujing/OnnxOCR).\n\n- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)\n- [OnnxOCR](https://github.com/jingsongliujing/OnnxOCR)\n- [layoutreader](https://github.com/ppaanngggg/layoutreader)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "doc page extractor can identify text and format in images and return structured data.",
    "version": "0.0.6",
    "project_urls": {
        "Homepage": "https://github.com/Moskize91/doc-page-extractor"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0c6182b0b3e229aa953a7201301d14974c807193e103b7ea4d485a630fa857e9",
                "md5": "1332cfae03ac30322c1a5f45446a5b42",
                "sha256": "39b75c3ee3afd20b431ee4c130882a9260a22d7cb1702dbd11bd5d837cac7ed9"
            },
            "downloads": -1,
            "filename": "doc_page_extractor-0.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1332cfae03ac30322c1a5f45446a5b42",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 50619,
            "upload_time": "2025-03-12T02:31:21",
            "upload_time_iso_8601": "2025-03-12T02:31:21.911949Z",
            "url": "https://files.pythonhosted.org/packages/0c/61/82b0b3e229aa953a7201301d14974c807193e103b7ea4d485a630fa857e9/doc_page_extractor-0.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a3719eb58bb801f0ef03a71b7050aee03e13d6cae01886421f888b5ba473d5cc",
                "md5": "decf4b6ff1615316452ffb47a8e81cd7",
                "sha256": "0fd4a9192fbec1c2eef0fd1213122d13c3536133d3cd0936814bcbef318cec70"
            },
            "downloads": -1,
            "filename": "doc_page_extractor-0.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "decf4b6ff1615316452ffb47a8e81cd7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 43183,
            "upload_time": "2025-03-12T02:31:23",
            "upload_time_iso_8601": "2025-03-12T02:31:23.287308Z",
            "url": "https://files.pythonhosted.org/packages/a3/71/9eb58bb801f0ef03a71b7050aee03e13d6cae01886421f888b5ba473d5cc/doc_page_extractor-0.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-03-12 02:31:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Moskize91",
    "github_project": "doc-page-extractor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "opencv-python",
            "specs": [
                [
                    "==",
                    "4.11.0.86"
                ]
            ]
        },
        {
            "name": "pyclipper",
            "specs": [
                [
                    "==",
                    "1.3.0.post6"
                ]
            ]
        },
        {
            "name": "onnxruntime",
            "specs": [
                [
                    "==",
                    "1.21.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "1.26.4"
                ]
            ]
        },
        {
            "name": "pillow",
            "specs": [
                [
                    "==",
                    "10.3.0"
                ]
            ]
        },
        {
            "name": "shapely",
            "specs": [
                [
                    "==",
                    "2.0.6"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": [
                [
                    "==",
                    "4.48.2"
                ]
            ]
        },
        {
            "name": "doclayout_yolo",
            "specs": [
                [
                    "==",
                    "0.0.3"
                ]
            ]
        }
    ],
    "lcname": "doc-page-extractor"
}

Tao Zeyu