# Textract-Pipeline-PageDimensions
Provides functions to add page dimensions with doc_width and doc_height to the Textract JSON schema for the PAGE blocks under the custom attribute in the form of:
e. g.
```
{'PageDimension': {'doc_width': 1549.0, 'doc_height': 370.0} }
```
# Install
```bash
> python -m pip install amazon-textract-pipeline-pagedimensions
```
Make sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)
# Samples
## Add Page dimensions for a local file
sample uses amazon-textract-caller amazon-textract-pipeline-pagedimensions
```bash
python -m pip install amazon-textract-caller
```
```python
from textractpagedimensions.t_pagedimensions import add_page_dimensions
from textractcaller.t_call import call_textract
from trp.trp2 import TDocument, TDocumentSchema
j = call_textract(input_document='<path to some image file>')
t_document: TDocument = TDocumentSchema().load(j)
add_page_dimensions(t_document=t_document, input_document=input_file)
print(t_document.pages[0].custom['PageDimension'])
# output will be something like this:
# {
# 'doc_width': 1544,
# 'doc_height': 1065
# }
```
## Using the Amazon Textact Helper command line tool with PageDimensions
Together with the Amazon Textract Helper and Amazon Textract Response Parser, we can build a pipeline that includes information about PageDimension and Orientation of pages
as a short demonstration on the information that is added to the Textract JSON.
```bash
> python -m pip install amazon-textract-helper amazon-textract-response-parser amazon-textract-pipeline-pagedimensions
> amazon-textract --input-document "s3://amazon-textract-public-content/blogs/2-pager-different-dimensions.pdf" | amazon-textract-pipeline-pagedimensions --input-document "s3://amazon-textract-public-content/blogs/2-pager-different-dimensions.pdf" | amazon-textract-pipeline --components add_page_orientation | jq '.Blocks[] | select(.BlockType=="PAGE") | .Custom'
{
"PageDimension": {
"doc_width": 1549,
"doc_height": 370
},
"Orientation": 0
}
{
"PageDimension": {
"doc_width": 1079,
"doc_height": 505
},
"Orientation": 0
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/aws-samples/amazon-textract-textractor/tree/master/tpipelinepagedimensions",
"name": "amazon-textract-pipeline-pagedimensions",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "amazon-textract-textractor amazon textract textractor pipeline page dimensions",
"author": "Amazon Rekognition Textract Demoes",
"author_email": "rekognition-textract-demos@amazon.com",
"download_url": "https://files.pythonhosted.org/packages/c4/c1/73efaf519831daca742cf181458fc9097542f037636f7a2b3112c53fe61a/amazon-textract-pipeline-pagedimensions-0.0.9.tar.gz",
"platform": null,
"description": "# Textract-Pipeline-PageDimensions\n\nProvides functions to add page dimensions with doc_width and doc_height to the Textract JSON schema for the PAGE blocks under the custom attribute in the form of:\n\ne. g.\n\n```\n{'PageDimension': {'doc_width': 1549.0, 'doc_height': 370.0} }\n```\n\n# Install\n\n```bash\n> python -m pip install amazon-textract-pipeline-pagedimensions\n```\n\nMake sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)\n\n# Samples\n\n## Add Page dimensions for a local file\n\nsample uses amazon-textract-caller amazon-textract-pipeline-pagedimensions\n\n```bash\npython -m pip install amazon-textract-caller\n```\n\n```python\nfrom textractpagedimensions.t_pagedimensions import add_page_dimensions\nfrom textractcaller.t_call import call_textract\nfrom trp.trp2 import TDocument, TDocumentSchema\n\nj = call_textract(input_document='<path to some image file>')\nt_document: TDocument = TDocumentSchema().load(j)\nadd_page_dimensions(t_document=t_document, input_document=input_file)\nprint(t_document.pages[0].custom['PageDimension']) \n# output will be something like this:\n# {\n# 'doc_width': 1544,\n# 'doc_height': 1065\n# }\n```\n\n## Using the Amazon Textact Helper command line tool with PageDimensions\n\nTogether with the Amazon Textract Helper and Amazon Textract Response Parser, we can build a pipeline that includes information about PageDimension and Orientation of pages\nas a short demonstration on the information that is added to the Textract JSON.\n\n```bash\n> python -m pip install amazon-textract-helper amazon-textract-response-parser amazon-textract-pipeline-pagedimensions\n> amazon-textract --input-document \"s3://amazon-textract-public-content/blogs/2-pager-different-dimensions.pdf\" | amazon-textract-pipeline-pagedimensions --input-document \"s3://amazon-textract-public-content/blogs/2-pager-different-dimensions.pdf\" | amazon-textract-pipeline --components add_page_orientation | jq '.Blocks[] | select(.BlockType==\"PAGE\") | .Custom'\n\n{\n \"PageDimension\": {\n \"doc_width\": 1549,\n \"doc_height\": 370\n },\n \"Orientation\": 0\n}\n{\n \"PageDimension\": {\n \"doc_width\": 1079,\n \"doc_height\": 505\n },\n \"Orientation\": 0\n}\n```\n\n\n",
"bugtrack_url": null,
"license": "Apache License Version 2.0",
"summary": "Amazon Textract Pipeline Component to add page dimensions to page block types",
"version": "0.0.9",
"project_urls": {
"Homepage": "https://github.com/aws-samples/amazon-textract-textractor/tree/master/tpipelinepagedimensions"
},
"split_keywords": [
"amazon-textract-textractor",
"amazon",
"textract",
"textractor",
"pipeline",
"page",
"dimensions"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "85e84e12c544ccc841ac5669d47a30f837ddaafe2477afd60baa029c9de2afbc",
"md5": "e568009b4ef8ff2f9b602abcd53777e1",
"sha256": "d8f4d40c0e14f24664077677af79f40c3858e2344f7f6cf38e0bb8961bdadb5e"
},
"downloads": -1,
"filename": "amazon_textract_pipeline_pagedimensions-0.0.9-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "e568009b4ef8ff2f9b602abcd53777e1",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.6",
"size": 9316,
"upload_time": "2023-10-20T20:15:59",
"upload_time_iso_8601": "2023-10-20T20:15:59.232032Z",
"url": "https://files.pythonhosted.org/packages/85/e8/4e12c544ccc841ac5669d47a30f837ddaafe2477afd60baa029c9de2afbc/amazon_textract_pipeline_pagedimensions-0.0.9-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c4c173efaf519831daca742cf181458fc9097542f037636f7a2b3112c53fe61a",
"md5": "07a75fff0fce031b73ef1925492331ac",
"sha256": "efafbaf97d11a2c25ac2a69362a0ff7d98883ff5341f9349ad5021619e4ec4f2"
},
"downloads": -1,
"filename": "amazon-textract-pipeline-pagedimensions-0.0.9.tar.gz",
"has_sig": false,
"md5_digest": "07a75fff0fce031b73ef1925492331ac",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 8778,
"upload_time": "2023-10-20T20:16:00",
"upload_time_iso_8601": "2023-10-20T20:16:00.916259Z",
"url": "https://files.pythonhosted.org/packages/c4/c1/73efaf519831daca742cf181458fc9097542f037636f7a2b3112c53fe61a/amazon-textract-pipeline-pagedimensions-0.0.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-20 20:16:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "aws-samples",
"github_project": "amazon-textract-textractor",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "amazon-textract-pipeline-pagedimensions"
}