# hocr_utils
![ci_master](https://github.com/Mrmarx/hocr_utils/workflows/CI/badge.svg?branch=master)
![pypi_package_version](https://img.shields.io/pypi/v/hocr-utils.svg?color=blue)
![pypi_python_version](https://img.shields.io/pypi/pyversions/hocr-utils.svg)
**hocr_utils** is a package to transform, plot and simplify the use of hOCR files.
# Installation
## Dependencies
hocr-utils requires:
- Python (>= |3.7|)
## Optional Dependencies
The functions to plot, transform pdf into hOCR require the following additional dependencies:
- pytesseract
- pdf2image
- opencv-python
Additionaly tesseract language pack need to be install for non-english ocr.
Example: install french language package on ubuntu with:
```bash
apt-get install tesseract-ocr-fra
```
## User Installation
The easiest way to install scikit-learn is using `pip`:
```bash
pip install -U hocr_utils
```
# Usecases
## Transform PIL Images to hOCR
Requires `pytesseract` dependency and the requested tesseract language pack.
```python
from hocr_utils import utils
from PIL import Image
image = Image.open('./data/sample.png')
hocr = utils.images_to_hocr([image])
```
## Transform pdf to hOCR
Requires `pytesseract`, `pdf2image` dependencies as well as the requested tesseract language pack.
```python
from hocr_utils import utils
hocr = utils.pdf_to_hocr('./data/sample.pdf')
```
## Transform hOCR to list of dictionary
```python
from hocr_utils import utils
hocr_dict = utils.hocr_to_dict(hocr)
```
This can then be transformed into pandas dataFrame using `pd.dataFrame.from_records(hocr_dict)`
By default there will be one entry per line in the hOCR, However it is possible to group the list by `['paragraph', 'line', 'word']` using `by` argument.
## Get a single page from hOCR
```python
from hocr_utils import utils
hocr_1 = utils.get_page(hocr, 1)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/Mrmarx/hocr_utils",
"name": "hocr-utils",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "hocr tesseract utility",
"author": "Antoine Dubuis",
"author_email": "antoine.dubuis@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/4d/2b/e6a0989b4b9451185593e4ec011b5654571572e20a65efa8b23e4b80ae04/hocr_utils-0.0.3.tar.gz",
"platform": null,
"description": "\n# hocr_utils\n\n![ci_master](https://github.com/Mrmarx/hocr_utils/workflows/CI/badge.svg?branch=master)\n\n![pypi_package_version](https://img.shields.io/pypi/v/hocr-utils.svg?color=blue)\n\n![pypi_python_version](https://img.shields.io/pypi/pyversions/hocr-utils.svg)\n\n**hocr_utils** is a package to transform, plot and simplify the use of hOCR files.\n\n# Installation\n\n## Dependencies\n\nhocr-utils requires:\n\n- Python (>= |3.7|)\n\n## Optional Dependencies\n\nThe functions to plot, transform pdf into hOCR require the following additional dependencies:\n\n- pytesseract\n- pdf2image\n- opencv-python\n\nAdditionaly tesseract language pack need to be install for non-english ocr.\n\nExample: install french language package on ubuntu with:\n\n```bash\napt-get install tesseract-ocr-fra\n```\n\n## User Installation\n\nThe easiest way to install scikit-learn is using `pip`:\n\n```bash\npip install -U hocr_utils\n\n```\n \n# Usecases\n\n## Transform PIL Images to hOCR\n\nRequires `pytesseract` dependency and the requested tesseract language pack.\n\n```python\nfrom hocr_utils import utils\nfrom PIL import Image\n\nimage = Image.open('./data/sample.png')\nhocr = utils.images_to_hocr([image])\n```\n\n## Transform pdf to hOCR\n\nRequires `pytesseract`, `pdf2image` dependencies as well as the requested tesseract language pack.\n\n\n```python\nfrom hocr_utils import utils\n\nhocr = utils.pdf_to_hocr('./data/sample.pdf')\n```\n\n## Transform hOCR to list of dictionary\n\n```python\nfrom hocr_utils import utils\n\nhocr_dict = utils.hocr_to_dict(hocr)\n```\nThis can then be transformed into pandas dataFrame using `pd.dataFrame.from_records(hocr_dict)`\n\nBy default there will be one entry per line in the hOCR, However it is possible to group the list by `['paragraph', 'line', 'word']` using `by` argument.\n\n## Get a single page from hOCR\n\n```python\nfrom hocr_utils import utils\n\nhocr_1 = utils.get_page(hocr, 1)\n```\n",
"bugtrack_url": null,
"license": "",
"summary": "Package containing utility function for hOCR and tesseract",
"version": "0.0.3",
"split_keywords": [
"hocr",
"tesseract",
"utility"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "dc108627bb7bc0379b18f70591b182d902c819fccfd3724158a1eb6429da9d1d",
"md5": "93ca97f488fd0d78c2c5a8496dcbab58",
"sha256": "27420555a2efe72fa0d2468df121461b072f167d19fcfe88d955aaa30ba3ffb9"
},
"downloads": -1,
"filename": "hocr_utils-0.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "93ca97f488fd0d78c2c5a8496dcbab58",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 7668,
"upload_time": "2023-03-16T12:17:02",
"upload_time_iso_8601": "2023-03-16T12:17:02.687644Z",
"url": "https://files.pythonhosted.org/packages/dc/10/8627bb7bc0379b18f70591b182d902c819fccfd3724158a1eb6429da9d1d/hocr_utils-0.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4d2be6a0989b4b9451185593e4ec011b5654571572e20a65efa8b23e4b80ae04",
"md5": "9525d26ba5d97ba8720302e0cfdc0cd1",
"sha256": "d3874d8a19d318402004107deb3d2c80fed3197aba44fe2cde7f14275ba4c695"
},
"downloads": -1,
"filename": "hocr_utils-0.0.3.tar.gz",
"has_sig": false,
"md5_digest": "9525d26ba5d97ba8720302e0cfdc0cd1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 7510,
"upload_time": "2023-03-16T12:17:04",
"upload_time_iso_8601": "2023-03-16T12:17:04.548474Z",
"url": "https://files.pythonhosted.org/packages/4d/2b/e6a0989b4b9451185593e4ec011b5654571572e20a65efa8b23e4b80ae04/hocr_utils-0.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-03-16 12:17:04",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "Mrmarx",
"github_project": "hocr_utils",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "hocr-utils"
}