Name | pero-ocr JSON |
Version |
0.7.0
JSON |
| download |
home_page | |
Summary | Toolkit for advanced OCR of poor quality documents |
upload_time | 2024-02-21 14:52:51 |
maintainer | |
docs_url | None |
author | |
requires_python | >=3.9 |
license | BSD 3-Clause License Copyright (c) 2019, DCGM All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
keywords |
ocr
layout analysis
handwriting recognition
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# pero-ocr
The package provides a full OCR pipeline including text paragraph detection, text line detection, text transcription, and text refinement using a language model.
The package can be used as a command line application or as a python package which provides a document processing class and a class which represents document page content.
## Please cite
If you use pero-ocr, please cite:
* O Kodym, M Hradiš: Page Layout Analysis System for Unconstrained Historic Documents. ICDAR, 2021.
* M Kišš, K Beneš, M Hradiš: AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions. ICDAR, 2021.
* J Kohút, M Hradiš: TS-Net: OCR Trained to Switch Between Text Transcription Styles. ICDAR, 2021.
## Running stuff
Scripts (as well as tests) assume that it is possible to import ``pero_ocr`` and its components.
For the current shell session, this can be achieved by setting ``PYTHONPATH`` up:
```
export PYTHONPATH=/path/to/the/repo:$PYTHONPATH
```
As a more permanent solution, a very simplistic `setup.py` is prepared:
```
python setup.py develop
```
Beware that the `setup.py` does not promise to bring all the required stuff, e.g. setting CUDA up is up to you.
Pero can be later removed from your Python distribution by running:
```
python setup.py develop --uninstall
```
## Available models
General layout analysis (printed and handwritten) with european printed OCR specialized to czech newspapers can be [downloaded here](https://nextcloud.fit.vutbr.cz/s/NtAbHTNkZFpapdJ). The OCR engine is suitable for most european printed documents. It is specialized for low-quality czech newspapers digitized from microfilms, but it provides very good results for almast all types of printed documents in most languages. If you are interested in processing printed fraktur fonts, handwritten documents or medieval manuscripts, feel free to contact the authors. The newest OCR engines are available at [pero-ocr.fit.vutbr.cz](https://pero-ocr.fit.vutbr.cz). OCR engines are available also through API runing at [pero-ocr.fit.vutbr.cz/api](https://pero-ocr.fit.vutbr.cz/api), [github repository](https://github.com/DCGM/pero-ocr-api).
## Command line application
A command line application is ./user_scripts/parse_folder.py. It is able to process images in a directory using an OCR engine. It can render detected lines in an image and provide document content in Page XML and ALTO XML formats. Additionally, it is able to crop all text lines as rectangular regions of normalized size and save them into separate image files.
## Running command line application in container
A docker container can be built from the sourcecode to run scripts and programs based on the pero-ocr. Example of running the `parse_folder.py` script to generate page-xml files for images in input directory:
```shell
docker run --rm --tty --interactive \
--volume path/to/input/dir:/input \
--volume path/to/output/dir:/output \
--volume path/to/ocr/engine:/engine \
--gpus all \
pero-ocr /usr/bin/python3 user_scripts/parse_folder.py \
--config /engine/config.ini \
--input-image-path /input \
--output-xml-path /output
```
Be sure to use container internal paths for passed in data in the command. All input and output data locations have to be passed to container via `--volume` argument due to container isolation. See [docker run command reference](https://docs.docker.com/engine/reference/run/) for more information.
Container can be built like this:
```shell
docker build -f Dockerfile -t pero-ocr .
```
## Integration of the pero-ocr python module
This example shows how to directly use the OCR pipeline provided by pero-ocr package. This shows how to integrate pero-ocr into other applications. Class PageLayout represents content of a single document page and can be loaded from Page XMl and exported to Page XML and ALTO XML formats. The OCR pipeline is represented by the PageParser class.
```python
import os
import configparser
import cv2
import numpy as np
from pero_ocr.document_ocr.layout import PageLayout
from pero_ocr.document_ocr.page_parser import PageParser
# Read config file.
config_path = "./config_file.ini"
config = configparser.ConfigParser()
config.read(config_path)
# Init the OCR pipeline.
# You have to specify config_path to be able to use relative paths
# inside the config file.
page_parser = PageParser(config, config_path=os.path.dirname(config_path))
# Read the document page image.
input_image_path = "page_image.jpg"
image = cv2.imread(input_image_path, 1)
# Init empty page content.
# This object will be updated by the ocr pipeline. id can be any string and it is used to identify the page.
page_layout = PageLayout(id=input_image_path,
page_size=(image.shape[0], image.shape[1]))
# Process the image by the OCR pipeline
page_layout = page_parser.process_page(image, page_layout)
page_layout.to_pagexml('output_page.xml') # Save results as Page XML.
page_layout.to_altoxml('output_ALTO.xml') # Save results as ALTO XML.
# Render detected text regions and text lines into the image and
# save it into a file.
rendered_image = page_layout.render_to_image(image)
cv2.imwrite('page_image_render.jpg', rendered_image)
# Save each cropped text line in a separate .jpg file.
for region in page_layout.regions:
for line in region.lines:
cv2.imwrite(f'file_id-{line.id}.jpg', line.crop.astype(np.uint8))
```
## Contributing
Working changes are expected to happen on `develop` branch, so if you plan to contribute, you better check it out right during cloning:
```
git clone -b develop git@github.com:DCGM/pero-ocr.git pero-ocr
```
### Testing
Currently, only unittests are provided with the code. Some of the code. So simply run your preferred test runner, e.g.:
```
~/pero-ocr $ green
```
#### Simple regression testing
Regression testing can be done by `test/processing_test.sh`. Script calls containerized `parser_folder.py` to process input images and page-xml files and calls user suplied comparison script to compare outputs to example outputs suplied by user. `PERO-OCR` container have to be built in advance to run the test, see 'Running command line application in container' chapter. Script can be called like this:
```shell
sh test/processing_test.sh \
--input-images path/to/input/image/directory \
--input-xmls path/to/input/page-xml/directory \
--output-dir path/to/output/dir \
--configuration path/to/ocr/engine/config.ini \
--example path/to/example/output/data \
--test-utility path/to/test/script \
--test-output path/to/testscript/output/dir \
--gpu-ids gpu ids for docker container
```
First 4 arguments are manadatory, `--gpu-ids` is preset by value 'all' which passes all gpus to the container. Test utility, example outputs and test output folder have to be set only if comparison of results should be performed. Test utility is expected to be path to `eval_ocr_pipeline_xml.py` script from `pero` repository. Be sure to correctly set PYTHONPATH and install dependencies for `pero` repository for the utility to work. Other script can be used if takes the same arguments. In other cases output data can be of course compared manually after processing.
Raw data
{
"_id": null,
"home_page": "",
"name": "pero-ocr",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "Karel Bene\u0161 <ibenes@fit.vutbr.cz>",
"keywords": "OCR,Layout analysis,handwriting recognition",
"author": "",
"author_email": "Michal Hradi\u0161 <hradis@fit.vutbr.cz>",
"download_url": "https://files.pythonhosted.org/packages/f1/46/3a0aeb8356db2f9bd35c382d3ab11b1281e285a6bc9183401c49d7163b7a/pero-ocr-0.7.0.tar.gz",
"platform": null,
"description": "# pero-ocr\nThe package provides a full OCR pipeline including text paragraph detection, text line detection, text transcription, and text refinement using a language model.\nThe package can be used as a command line application or as a python package which provides a document processing class and a class which represents document page content.\n\n\n## Please cite\nIf you use pero-ocr, please cite:\n\n* O Kodym, M Hradi\u0161: Page Layout Analysis System for Unconstrained Historic Documents. ICDAR, 2021.\n* M Ki\u0161\u0161, K Bene\u0161, M Hradi\u0161: AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions. ICDAR, 2021.\n* J Koh\u00fat, M Hradi\u0161: TS-Net: OCR Trained to Switch Between Text Transcription Styles. ICDAR, 2021.\n\n## Running stuff\nScripts (as well as tests) assume that it is possible to import ``pero_ocr`` and its components.\n\nFor the current shell session, this can be achieved by setting ``PYTHONPATH`` up:\n```\nexport PYTHONPATH=/path/to/the/repo:$PYTHONPATH\n```\n\nAs a more permanent solution, a very simplistic `setup.py` is prepared:\n```\npython setup.py develop\n```\nBeware that the `setup.py` does not promise to bring all the required stuff, e.g. setting CUDA up is up to you.\n\nPero can be later removed from your Python distribution by running:\n```\npython setup.py develop --uninstall\n```\n\n## Available models\nGeneral layout analysis (printed and handwritten) with european printed OCR specialized to czech newspapers can be [downloaded here](https://nextcloud.fit.vutbr.cz/s/NtAbHTNkZFpapdJ). The OCR engine is suitable for most european printed documents. It is specialized for low-quality czech newspapers digitized from microfilms, but it provides very good results for almast all types of printed documents in most languages. If you are interested in processing printed fraktur fonts, handwritten documents or medieval manuscripts, feel free to contact the authors. The newest OCR engines are available at [pero-ocr.fit.vutbr.cz](https://pero-ocr.fit.vutbr.cz). OCR engines are available also through API runing at [pero-ocr.fit.vutbr.cz/api](https://pero-ocr.fit.vutbr.cz/api), [github repository](https://github.com/DCGM/pero-ocr-api).\n\n## Command line application\nA command line application is ./user_scripts/parse_folder.py. It is able to process images in a directory using an OCR engine. It can render detected lines in an image and provide document content in Page XML and ALTO XML formats. Additionally, it is able to crop all text lines as rectangular regions of normalized size and save them into separate image files.\n\n## Running command line application in container\nA docker container can be built from the sourcecode to run scripts and programs based on the pero-ocr. Example of running the `parse_folder.py` script to generate page-xml files for images in input directory:\n```shell\ndocker run --rm --tty --interactive \\\n --volume path/to/input/dir:/input \\\n --volume path/to/output/dir:/output \\\n --volume path/to/ocr/engine:/engine \\\n --gpus all \\\n pero-ocr /usr/bin/python3 user_scripts/parse_folder.py \\\n --config /engine/config.ini \\\n --input-image-path /input \\\n --output-xml-path /output\n```\nBe sure to use container internal paths for passed in data in the command. All input and output data locations have to be passed to container via `--volume` argument due to container isolation. See [docker run command reference](https://docs.docker.com/engine/reference/run/) for more information.\n\nContainer can be built like this:\n```shell\ndocker build -f Dockerfile -t pero-ocr .\n```\n\n## Integration of the pero-ocr python module\nThis example shows how to directly use the OCR pipeline provided by pero-ocr package. This shows how to integrate pero-ocr into other applications. Class PageLayout represents content of a single document page and can be loaded from Page XMl and exported to Page XML and ALTO XML formats. The OCR pipeline is represented by the PageParser class.\n\n```python\nimport os\nimport configparser\nimport cv2\nimport numpy as np\nfrom pero_ocr.document_ocr.layout import PageLayout\nfrom pero_ocr.document_ocr.page_parser import PageParser\n\n# Read config file.\nconfig_path = \"./config_file.ini\"\nconfig = configparser.ConfigParser()\nconfig.read(config_path)\n\n# Init the OCR pipeline. \n# You have to specify config_path to be able to use relative paths\n# inside the config file.\npage_parser = PageParser(config, config_path=os.path.dirname(config_path))\n\n# Read the document page image.\ninput_image_path = \"page_image.jpg\"\nimage = cv2.imread(input_image_path, 1)\n\n# Init empty page content. \n# This object will be updated by the ocr pipeline. id can be any string and it is used to identify the page.\npage_layout = PageLayout(id=input_image_path,\n page_size=(image.shape[0], image.shape[1]))\n\n# Process the image by the OCR pipeline\npage_layout = page_parser.process_page(image, page_layout)\n\npage_layout.to_pagexml('output_page.xml') # Save results as Page XML.\npage_layout.to_altoxml('output_ALTO.xml') # Save results as ALTO XML.\n\n# Render detected text regions and text lines into the image and\n# save it into a file.\nrendered_image = page_layout.render_to_image(image) \ncv2.imwrite('page_image_render.jpg', rendered_image)\n\n# Save each cropped text line in a separate .jpg file.\nfor region in page_layout.regions:\n for line in region.lines:\n cv2.imwrite(f'file_id-{line.id}.jpg', line.crop.astype(np.uint8))\n```\n\n\n## Contributing\nWorking changes are expected to happen on `develop` branch, so if you plan to contribute, you better check it out right during cloning:\n\n```\ngit clone -b develop git@github.com:DCGM/pero-ocr.git pero-ocr\n```\n\n### Testing\nCurrently, only unittests are provided with the code. Some of the code. So simply run your preferred test runner, e.g.:\n```\n~/pero-ocr $ green\n```\n\n#### Simple regression testing\nRegression testing can be done by `test/processing_test.sh`. Script calls containerized `parser_folder.py` to process input images and page-xml files and calls user suplied comparison script to compare outputs to example outputs suplied by user. `PERO-OCR` container have to be built in advance to run the test, see 'Running command line application in container' chapter. Script can be called like this:\n```shell\nsh test/processing_test.sh \\\n --input-images path/to/input/image/directory \\\n --input-xmls path/to/input/page-xml/directory \\\n --output-dir path/to/output/dir \\\n --configuration path/to/ocr/engine/config.ini \\\n --example path/to/example/output/data \\\n --test-utility path/to/test/script \\\n --test-output path/to/testscript/output/dir \\\n --gpu-ids gpu ids for docker container\n```\n\nFirst 4 arguments are manadatory, `--gpu-ids` is preset by value 'all' which passes all gpus to the container. Test utility, example outputs and test output folder have to be set only if comparison of results should be performed. Test utility is expected to be path to `eval_ocr_pipeline_xml.py` script from `pero` repository. Be sure to correctly set PYTHONPATH and install dependencies for `pero` repository for the utility to work. Other script can be used if takes the same arguments. In other cases output data can be of course compared manually after processing.\n",
"bugtrack_url": null,
"license": "BSD 3-Clause License Copyright (c) 2019, DCGM All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ",
"summary": "Toolkit for advanced OCR of poor quality documents",
"version": "0.7.0",
"project_urls": {
"homepage": "https://pero.fit.vutbr.cz/",
"repository": "https://github.com/DCGM/pero-ocr"
},
"split_keywords": [
"ocr",
"layout analysis",
"handwriting recognition"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1ad253e86ccdc36cf6549022d67ccb7f577ac22f6f634f65c763b318ba13bd58",
"md5": "39dd4e7587b3b5c09a0dd375cb6964ee",
"sha256": "f503f0f0578928893e92dc20abb2d7826f0141f1881f6973a12f0d4721ad1abb"
},
"downloads": -1,
"filename": "pero_ocr-0.7.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "39dd4e7587b3b5c09a0dd375cb6964ee",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 95438,
"upload_time": "2024-02-21T14:52:49",
"upload_time_iso_8601": "2024-02-21T14:52:49.512806Z",
"url": "https://files.pythonhosted.org/packages/1a/d2/53e86ccdc36cf6549022d67ccb7f577ac22f6f634f65c763b318ba13bd58/pero_ocr-0.7.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f1463a0aeb8356db2f9bd35c382d3ab11b1281e285a6bc9183401c49d7163b7a",
"md5": "a8c2b9a8a1ff4315221c5d1674f00530",
"sha256": "f4cdd44c03a02bac437cade6a7f60365ad7b33b4a97fd0ff4641bb8a122f043d"
},
"downloads": -1,
"filename": "pero-ocr-0.7.0.tar.gz",
"has_sig": false,
"md5_digest": "a8c2b9a8a1ff4315221c5d1674f00530",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 93577,
"upload_time": "2024-02-21T14:52:51",
"upload_time_iso_8601": "2024-02-21T14:52:51.864533Z",
"url": "https://files.pythonhosted.org/packages/f1/46/3a0aeb8356db2f9bd35c382d3ab11b1281e285a6bc9183401c49d7163b7a/pero-ocr-0.7.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-21 14:52:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "DCGM",
"github_project": "pero-ocr",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pero-ocr"
}