amazon-textract-pipeline-pagedimensions


Nameamazon-textract-pipeline-pagedimensions JSON
Version 0.0.9 PyPI version JSON
download
home_pagehttps://github.com/aws-samples/amazon-textract-textractor/tree/master/tpipelinepagedimensions
SummaryAmazon Textract Pipeline Component to add page dimensions to page block types
upload_time2023-10-20 20:16:00
maintainer
docs_urlNone
authorAmazon Rekognition Textract Demoes
requires_python>=3.6
licenseApache License Version 2.0
keywords amazon-textract-textractor amazon textract textractor pipeline page dimensions
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Textract-Pipeline-PageDimensions

Provides functions to add page dimensions with doc_width and doc_height to the Textract JSON schema for the PAGE blocks under the custom attribute in the form of:

e. g.

```
{'PageDimension': {'doc_width': 1549.0, 'doc_height': 370.0} }
```

# Install

```bash
> python -m pip install amazon-textract-pipeline-pagedimensions
```

Make sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)

# Samples

## Add Page dimensions for a local file

sample uses amazon-textract-caller amazon-textract-pipeline-pagedimensions

```bash
python -m pip install amazon-textract-caller
```

```python
from textractpagedimensions.t_pagedimensions import add_page_dimensions
from textractcaller.t_call import call_textract
from trp.trp2 import TDocument, TDocumentSchema

j = call_textract(input_document='<path to some image file>')
t_document: TDocument = TDocumentSchema().load(j)
add_page_dimensions(t_document=t_document, input_document=input_file)
print(t_document.pages[0].custom['PageDimension']) 
# output will be something like this:
# {
#     'doc_width': 1544,
#     'doc_height': 1065
# }
```

## Using the Amazon Textact Helper command line tool with PageDimensions

Together with the Amazon Textract Helper and Amazon Textract Response Parser, we can build a pipeline that includes information about PageDimension and Orientation of pages
as a short demonstration on the information that is added to the Textract JSON.

```bash
> python -m pip install amazon-textract-helper amazon-textract-response-parser amazon-textract-pipeline-pagedimensions
> amazon-textract --input-document "s3://amazon-textract-public-content/blogs/2-pager-different-dimensions.pdf" | amazon-textract-pipeline-pagedimensions --input-document "s3://amazon-textract-public-content/blogs/2-pager-different-dimensions.pdf"  | amazon-textract-pipeline --components add_page_orientation | jq '.Blocks[] | select(.BlockType=="PAGE") | .Custom'

{
  "PageDimension": {
    "doc_width": 1549,
    "doc_height": 370
  },
  "Orientation": 0
}
{
  "PageDimension": {
    "doc_width": 1079,
    "doc_height": 505
  },
  "Orientation": 0
}
```



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/aws-samples/amazon-textract-textractor/tree/master/tpipelinepagedimensions",
    "name": "amazon-textract-pipeline-pagedimensions",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "amazon-textract-textractor amazon textract textractor pipeline page dimensions",
    "author": "Amazon Rekognition Textract Demoes",
    "author_email": "rekognition-textract-demos@amazon.com",
    "download_url": "https://files.pythonhosted.org/packages/c4/c1/73efaf519831daca742cf181458fc9097542f037636f7a2b3112c53fe61a/amazon-textract-pipeline-pagedimensions-0.0.9.tar.gz",
    "platform": null,
    "description": "# Textract-Pipeline-PageDimensions\n\nProvides functions to add page dimensions with doc_width and doc_height to the Textract JSON schema for the PAGE blocks under the custom attribute in the form of:\n\ne. g.\n\n```\n{'PageDimension': {'doc_width': 1549.0, 'doc_height': 370.0} }\n```\n\n# Install\n\n```bash\n> python -m pip install amazon-textract-pipeline-pagedimensions\n```\n\nMake sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)\n\n# Samples\n\n## Add Page dimensions for a local file\n\nsample uses amazon-textract-caller amazon-textract-pipeline-pagedimensions\n\n```bash\npython -m pip install amazon-textract-caller\n```\n\n```python\nfrom textractpagedimensions.t_pagedimensions import add_page_dimensions\nfrom textractcaller.t_call import call_textract\nfrom trp.trp2 import TDocument, TDocumentSchema\n\nj = call_textract(input_document='<path to some image file>')\nt_document: TDocument = TDocumentSchema().load(j)\nadd_page_dimensions(t_document=t_document, input_document=input_file)\nprint(t_document.pages[0].custom['PageDimension']) \n# output will be something like this:\n# {\n#     'doc_width': 1544,\n#     'doc_height': 1065\n# }\n```\n\n## Using the Amazon Textact Helper command line tool with PageDimensions\n\nTogether with the Amazon Textract Helper and Amazon Textract Response Parser, we can build a pipeline that includes information about PageDimension and Orientation of pages\nas a short demonstration on the information that is added to the Textract JSON.\n\n```bash\n> python -m pip install amazon-textract-helper amazon-textract-response-parser amazon-textract-pipeline-pagedimensions\n> amazon-textract --input-document \"s3://amazon-textract-public-content/blogs/2-pager-different-dimensions.pdf\" | amazon-textract-pipeline-pagedimensions --input-document \"s3://amazon-textract-public-content/blogs/2-pager-different-dimensions.pdf\"  | amazon-textract-pipeline --components add_page_orientation | jq '.Blocks[] | select(.BlockType==\"PAGE\") | .Custom'\n\n{\n  \"PageDimension\": {\n    \"doc_width\": 1549,\n    \"doc_height\": 370\n  },\n  \"Orientation\": 0\n}\n{\n  \"PageDimension\": {\n    \"doc_width\": 1079,\n    \"doc_height\": 505\n  },\n  \"Orientation\": 0\n}\n```\n\n\n",
    "bugtrack_url": null,
    "license": "Apache License Version 2.0",
    "summary": "Amazon Textract Pipeline Component to add page dimensions to page block types",
    "version": "0.0.9",
    "project_urls": {
        "Homepage": "https://github.com/aws-samples/amazon-textract-textractor/tree/master/tpipelinepagedimensions"
    },
    "split_keywords": [
        "amazon-textract-textractor",
        "amazon",
        "textract",
        "textractor",
        "pipeline",
        "page",
        "dimensions"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "85e84e12c544ccc841ac5669d47a30f837ddaafe2477afd60baa029c9de2afbc",
                "md5": "e568009b4ef8ff2f9b602abcd53777e1",
                "sha256": "d8f4d40c0e14f24664077677af79f40c3858e2344f7f6cf38e0bb8961bdadb5e"
            },
            "downloads": -1,
            "filename": "amazon_textract_pipeline_pagedimensions-0.0.9-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e568009b4ef8ff2f9b602abcd53777e1",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.6",
            "size": 9316,
            "upload_time": "2023-10-20T20:15:59",
            "upload_time_iso_8601": "2023-10-20T20:15:59.232032Z",
            "url": "https://files.pythonhosted.org/packages/85/e8/4e12c544ccc841ac5669d47a30f837ddaafe2477afd60baa029c9de2afbc/amazon_textract_pipeline_pagedimensions-0.0.9-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c4c173efaf519831daca742cf181458fc9097542f037636f7a2b3112c53fe61a",
                "md5": "07a75fff0fce031b73ef1925492331ac",
                "sha256": "efafbaf97d11a2c25ac2a69362a0ff7d98883ff5341f9349ad5021619e4ec4f2"
            },
            "downloads": -1,
            "filename": "amazon-textract-pipeline-pagedimensions-0.0.9.tar.gz",
            "has_sig": false,
            "md5_digest": "07a75fff0fce031b73ef1925492331ac",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 8778,
            "upload_time": "2023-10-20T20:16:00",
            "upload_time_iso_8601": "2023-10-20T20:16:00.916259Z",
            "url": "https://files.pythonhosted.org/packages/c4/c1/73efaf519831daca742cf181458fc9097542f037636f7a2b3112c53fe61a/amazon-textract-pipeline-pagedimensions-0.0.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-20 20:16:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "aws-samples",
    "github_project": "amazon-textract-textractor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "amazon-textract-pipeline-pagedimensions"
}
        
Elapsed time: 0.12575s