textract2page

Name	textract2page JSON
Version	0.1 JSON
	download
home_page	https://github.com/slub/textract2page
Summary	Convert AWS Textract JSON to PRImA PAGE XML
upload_time	2023-05-04 12:42:14
maintainer
docs_url	None
author	Arne Rümmler
requires_python	>=3.7
license	Apache Software License
keywords	ocr mets page-xml aws
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # textract2page

> Convert AWS Textract JSON to PRImA PAGE XML


## Introduction

This software converts OCR results from
[Amazon AWS Textract Response](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html)
files to [PAGE XML](https://github.com/PRImA-Research-Lab/PAGE-XML) files.

## Installation

In a Python [virtualenv](https://packaging.python.org/tutorials/installing-packages/#creating-virtual-environments):

    pip install textract2page

## Usage

The package contains a file-based conversion function provided as CLI and Python API.
The function takes the Textract JSON file and the original image file which was used
as input for the OCR. (That is necessary because Textract stores coordinates in
`float` ratios, whereas PAGE uses `int` in pixel indices.)

### Python API

To convert a Textract file `example.json` for an image file `example.jpg` to a PAGE `example.xml`:

```python
from textract2page import convert_file

convert_file("example.json", "example.jpg", "example.xml")
```

### CLI

Analogously, on the command line interface:

    textract2page example.json example.jpg > example.xml
    textract2page -O example.xml example.json example.jpg

You can get a list of options with `--help` or `-h`

## Testing

Requires installation and a local copy of the repository.

To run regression tests with `pytest`, do

    make deps-test
    make test-api

To run regression test via command line, do

    # optionally:
    sudo apt-get install xmlstarlet
    make test-cli

(If `xmlstarlet` is available, then the CLI test will
also validate the result tree. Otherwise, this just
checks the command completes without error.)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/slub/textract2page",
    "name": "textract2page",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "OCR,METS,PAGE-XML,AWS",
    "author": "Arne R\u00fcmmler",
    "author_email": "arne.ruemmler@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/b2/1e/c362e9c0718de5266cd26196f97980056b2cea3877db065ae476c39152c4/textract2page-0.1.tar.gz",
    "platform": null,
    "description": "# textract2page\n\n> Convert AWS Textract JSON to PRImA PAGE XML\n\n\n## Introduction\n\nThis software converts OCR results from\n[Amazon AWS Textract Response](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html)\nfiles to [PAGE XML](https://github.com/PRImA-Research-Lab/PAGE-XML) files.\n\n## Installation\n\nIn a Python [virtualenv](https://packaging.python.org/tutorials/installing-packages/#creating-virtual-environments):\n\n    pip install textract2page\n\n## Usage\n\nThe package contains a file-based conversion function provided as CLI and Python API.\nThe function takes the Textract JSON file and the original image file which was used\nas input for the OCR. (That is necessary because Textract stores coordinates in\n`float` ratios, whereas PAGE uses `int` in pixel indices.)\n\n### Python API\n\nTo convert a Textract file `example.json` for an image file `example.jpg` to a PAGE `example.xml`:\n\n```python\nfrom textract2page import convert_file\n\nconvert_file(\"example.json\", \"example.jpg\", \"example.xml\")\n```\n\n### CLI\n\nAnalogously, on the command line interface:\n\n    textract2page example.json example.jpg > example.xml\n    textract2page -O example.xml example.json example.jpg\n\nYou can get a list of options with `--help` or `-h`\n\n## Testing\n\nRequires installation and a local copy of the repository.\n\nTo run regression tests with `pytest`, do\n\n    make deps-test\n    make test-api\n\nTo run regression test via command line, do\n\n    # optionally:\n    sudo apt-get install xmlstarlet\n    make test-cli\n\n(If `xmlstarlet` is available, then the CLI test will\nalso validate the result tree. Otherwise, this just\nchecks the command completes without error.)\n",
    "bugtrack_url": null,
    "license": "Apache Software License",
    "summary": "Convert AWS Textract JSON to PRImA PAGE XML",
    "version": "0.1",
    "project_urls": {
        "Homepage": "https://github.com/slub/textract2page"
    },
    "split_keywords": [
        "ocr",
        "mets",
        "page-xml",
        "aws"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "60a457cde2464769467e3902504e0775c76af35daf0d6e98faf416684523cd4f",
                "md5": "2e3fc3c2151681095e7058e682076a7b",
                "sha256": "ea7f3e191cb564a9ffa75f995486f03bb8f8d5540b0d22068df3a5910b088b89"
            },
            "downloads": -1,
            "filename": "textract2page-0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2e3fc3c2151681095e7058e682076a7b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 9875,
            "upload_time": "2023-05-04T12:42:11",
            "upload_time_iso_8601": "2023-05-04T12:42:11.745285Z",
            "url": "https://files.pythonhosted.org/packages/60/a4/57cde2464769467e3902504e0775c76af35daf0d6e98faf416684523cd4f/textract2page-0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b21ec362e9c0718de5266cd26196f97980056b2cea3877db065ae476c39152c4",
                "md5": "68dd6c5aed7498efb07f17a874463e20",
                "sha256": "60c3ddcd304e55d50d2e899d7c853c21461789e3a42bbce6b1af2c76b8ddc7ca"
            },
            "downloads": -1,
            "filename": "textract2page-0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "68dd6c5aed7498efb07f17a874463e20",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 9105,
            "upload_time": "2023-05-04T12:42:14",
            "upload_time_iso_8601": "2023-05-04T12:42:14.031116Z",
            "url": "https://files.pythonhosted.org/packages/b2/1e/c362e9c0718de5266cd26196f97980056b2cea3877db065ae476c39152c4/textract2page-0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-04 12:42:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "slub",
    "github_project": "textract2page",
    "github_not_found": true,
    "lcname": "textract2page"
}

Arne Rümmler