hocr-utils

Name	hocr-utils JSON
Version	0.0.3 JSON
	download
home_page	https://github.com/Mrmarx/hocr_utils
Summary	Package containing utility function for hOCR and tesseract
upload_time	2023-03-16 12:17:04
maintainer
docs_url	None
author	Antoine Dubuis
requires_python	>=3.7
license
keywords	hocr tesseract utility
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            
# hocr_utils

![ci_master](https://github.com/Mrmarx/hocr_utils/workflows/CI/badge.svg?branch=master)

![pypi_package_version](https://img.shields.io/pypi/v/hocr-utils.svg?color=blue)

![pypi_python_version](https://img.shields.io/pypi/pyversions/hocr-utils.svg)

**hocr_utils** is a package to transform, plot and simplify the use of hOCR files.

# Installation

## Dependencies

hocr-utils requires:

- Python (>= |3.7|)

## Optional Dependencies

The functions to plot, transform pdf into hOCR require the following additional dependencies:

- pytesseract
- pdf2image
- opencv-python

Additionaly tesseract language pack need to be install for non-english ocr.

Example: install french language package on ubuntu with:

```bash
apt-get install tesseract-ocr-fra
```

## User Installation

The easiest way to install scikit-learn is using `pip`:

```bash
pip install -U hocr_utils

```
    
# Usecases

## Transform PIL Images to hOCR

Requires `pytesseract` dependency and the requested tesseract language pack.

```python
from hocr_utils import utils
from PIL import Image

image = Image.open('./data/sample.png')
hocr = utils.images_to_hocr([image])
```

## Transform pdf to hOCR

Requires `pytesseract`, `pdf2image` dependencies as well as the requested tesseract language pack.


```python
from hocr_utils import utils

hocr = utils.pdf_to_hocr('./data/sample.pdf')
```

## Transform hOCR to list of dictionary

```python
from hocr_utils import utils

hocr_dict = utils.hocr_to_dict(hocr)
```
This can then be transformed into pandas dataFrame using `pd.dataFrame.from_records(hocr_dict)`

By default there will be one entry per line in the hOCR, However it is possible to group the list by `['paragraph', 'line', 'word']` using `by` argument.

## Get a single page from hOCR

```python
from hocr_utils import utils

hocr_1 = utils.get_page(hocr, 1)
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Mrmarx/hocr_utils",
    "name": "hocr-utils",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "hocr tesseract utility",
    "author": "Antoine Dubuis",
    "author_email": "antoine.dubuis@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/4d/2b/e6a0989b4b9451185593e4ec011b5654571572e20a65efa8b23e4b80ae04/hocr_utils-0.0.3.tar.gz",
    "platform": null,
    "description": "\n# hocr_utils\n\n![ci_master](https://github.com/Mrmarx/hocr_utils/workflows/CI/badge.svg?branch=master)\n\n![pypi_package_version](https://img.shields.io/pypi/v/hocr-utils.svg?color=blue)\n\n![pypi_python_version](https://img.shields.io/pypi/pyversions/hocr-utils.svg)\n\n**hocr_utils** is a package to transform, plot and simplify the use of hOCR files.\n\n# Installation\n\n## Dependencies\n\nhocr-utils requires:\n\n- Python (>= |3.7|)\n\n## Optional Dependencies\n\nThe functions to plot, transform pdf into hOCR require the following additional dependencies:\n\n- pytesseract\n- pdf2image\n- opencv-python\n\nAdditionaly tesseract language pack need to be install for non-english ocr.\n\nExample: install french language package on ubuntu with:\n\n```bash\napt-get install tesseract-ocr-fra\n```\n\n## User Installation\n\nThe easiest way to install scikit-learn is using `pip`:\n\n```bash\npip install -U hocr_utils\n\n```\n    \n# Usecases\n\n## Transform PIL Images to hOCR\n\nRequires `pytesseract` dependency and the requested tesseract language pack.\n\n```python\nfrom hocr_utils import utils\nfrom PIL import Image\n\nimage = Image.open('./data/sample.png')\nhocr = utils.images_to_hocr([image])\n```\n\n## Transform pdf to hOCR\n\nRequires `pytesseract`, `pdf2image` dependencies as well as the requested tesseract language pack.\n\n\n```python\nfrom hocr_utils import utils\n\nhocr = utils.pdf_to_hocr('./data/sample.pdf')\n```\n\n## Transform hOCR to list of dictionary\n\n```python\nfrom hocr_utils import utils\n\nhocr_dict = utils.hocr_to_dict(hocr)\n```\nThis can then be transformed into pandas dataFrame using `pd.dataFrame.from_records(hocr_dict)`\n\nBy default there will be one entry per line in the hOCR, However it is possible to group the list by `['paragraph', 'line', 'word']` using `by` argument.\n\n## Get a single page from hOCR\n\n```python\nfrom hocr_utils import utils\n\nhocr_1 = utils.get_page(hocr, 1)\n```\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Package containing utility function for hOCR and tesseract",
    "version": "0.0.3",
    "split_keywords": [
        "hocr",
        "tesseract",
        "utility"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dc108627bb7bc0379b18f70591b182d902c819fccfd3724158a1eb6429da9d1d",
                "md5": "93ca97f488fd0d78c2c5a8496dcbab58",
                "sha256": "27420555a2efe72fa0d2468df121461b072f167d19fcfe88d955aaa30ba3ffb9"
            },
            "downloads": -1,
            "filename": "hocr_utils-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "93ca97f488fd0d78c2c5a8496dcbab58",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 7668,
            "upload_time": "2023-03-16T12:17:02",
            "upload_time_iso_8601": "2023-03-16T12:17:02.687644Z",
            "url": "https://files.pythonhosted.org/packages/dc/10/8627bb7bc0379b18f70591b182d902c819fccfd3724158a1eb6429da9d1d/hocr_utils-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4d2be6a0989b4b9451185593e4ec011b5654571572e20a65efa8b23e4b80ae04",
                "md5": "9525d26ba5d97ba8720302e0cfdc0cd1",
                "sha256": "d3874d8a19d318402004107deb3d2c80fed3197aba44fe2cde7f14275ba4c695"
            },
            "downloads": -1,
            "filename": "hocr_utils-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "9525d26ba5d97ba8720302e0cfdc0cd1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 7510,
            "upload_time": "2023-03-16T12:17:04",
            "upload_time_iso_8601": "2023-03-16T12:17:04.548474Z",
            "url": "https://files.pythonhosted.org/packages/4d/2b/e6a0989b4b9451185593e4ec011b5654571572e20a65efa8b23e4b80ae04/hocr_utils-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-03-16 12:17:04",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "Mrmarx",
    "github_project": "hocr_utils",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "hocr-utils"
}

Antoine Dubuis