amazon-textract-caller


Nameamazon-textract-caller JSON
Version 0.2.4 PyPI version JSON
download
home_pagehttps://github.com/aws-samples/amazon-textract-textractor/tree/master/caller
SummaryAmazon Textract Caller tools
upload_time2024-06-20 22:14:43
maintainerNone
docs_urlNone
authorAmazon Rekognition Textract Demoes
requires_python>=3.6
licenseApache License Version 2.0
keywords amazon-textract-textractor amazon textract textractor helper caller
VCS
bugtrack_url
requirements amazon-textract-caller Pillow tabulate XlsxWriter editdistance
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Textract-Caller

amazon-textract-caller provides a collection of ready to use functions and sample implementations to speed up the evaluation and development for any project using Amazon Textract.

Making it easy to call Amazon Textract regardless of file type and location.

## Install

```bash
> python -m pip install amazon-textract-caller
```

## Functions

```python
from textractcaller import call_textract
def call_textract(input_document: Union[str, bytes],
                  features: Optional[List[Textract_Features]] = None,
                  queries_config: Optional[QueriesConfig] = None,
                  output_config: Optional[OutputConfig] = None,
                  adapters_config: Optional[AdaptersConfig] = None,
                  kms_key_id: str = "",
                  job_tag: str = "",
                  notification_channel: Optional[NotificationChannel] = None,
                  client_request_token: str = "",
                  return_job_id: bool = False,
                  force_async_api: bool = False,
                  call_mode: Textract_Call_Mode = Textract_Call_Mode.DEFAULT,
                  boto3_textract_client=None,
                  job_done_polling_interval=1) -> dict:
```

Also useful when receiving the JSON response from an asynchronous job (start_document_text_detection or start_document_analysis)

```python
from textractcaller import get_full_json
def get_full_json(job_id: str = None,
                  textract_api: Textract_API = Textract_API.DETECT,
                  boto3_textract_client=None)->dict:
```

And when receiving the JSON from the OutputConfig location, this method is useful as well.

```python
from textractcaller import get_full_json_from_output_config
def get_full_json_from_output_config(output_config: OutputConfig = None,
                                     job_id: str = None,
                                     s3_client = None)->dict:
```

## Samples

### Calling with file from local filesystem only with detect_text

```python
textract_json = call_textract(input_document="/folder/local-filesystem-file.png")
```

### Calling with file from local filesystem only detect_text and using in Textract Response Parser

(needs trp dependency through ```python -m pip install amazon-textract-response-parser```)

```python
import json
from trp import Document
from textractcaller import call_textract

textract_json = call_textract(input_document="/folder/local-filesystem-file.png")
d = Document(textract_json)
```

### Calling with Queries for a multi-page document and extract the Answers

sample also uses the amazon-textract-response-parser

```
python -m pip install amazon-textract-caller amazon-textract-response-parser
```

```python
import textractcaller as tc
import trp.trp2 as t2
import boto3

textract = boto3.client('textract', region_name="us-east-2")
q1 = tc.Query(text="What is the employee SSN?", alias="SSN", pages=["1"])
q2 = tc.Query(text="What is YTD gross pay?", alias="GROSS_PAY", pages=["2"])
textract_json = tc.call_textract(
    input_document="s3://amazon-textract-public-content/blogs/2-pager.pdf",
    queries_config=tc.QueriesConfig(queries=[q1, q2]),
    features=[tc.Textract_Features.QUERIES],
    force_async_api=True,
    boto3_textract_client=textract)
t_doc: t2.TDocument = t2.TDocumentSchema().load(textract_json)  # type: ignore
for page in t_doc.pages:
    query_answers = t_doc.get_query_answers(page=page)
    for x in query_answers:
        print(f"{x[1]},{x[2]}")
```

### Calling with Custom Queries for a multi-page document using an adapter

sample also uses the amazon-textract-response-parser

```
python -m pip install amazon-textract-caller amazon-textract-response-parser
```

```python
import textractcaller as tc
import trp.trp2 as t2
import boto3

textract = boto3.client('textract', region_name="us-east-2")
q1 = tc.Query(text="What is the employee SSN?", alias="SSN", pages=["1"])
q2 = tc.Query(text="What is YTD gross pay?", alias="GROSS_PAY", pages=["2"])
adapter1 = tc.Adapter(adapter_id="2e9bf1c4aa31", version="1", pages=["1"])
textract_json = tc.call_textract(
    input_document="s3://amazon-textract-public-content/blogs/2-pager.pdf",
    queries_config=tc.QueriesConfig(queries=[q1, q2]),
    adapters_config=tc.AdaptersConfig(adapters=[adapter1])
    features=[tc.Textract_Features.QUERIES],
    force_async_api=True,
    boto3_textract_client=textract)
t_doc: t2.TDocument = t2.TDocumentSchema().load(textract_json)  # type: ignore
for page in t_doc.pages:
    query_answers = t_doc.get_query_answers(page=page)
    for x in query_answers:
        print(f"{x[1]},{x[2]}")
```


### Calling with file from local filesystem with TABLES features

```python
from textractcaller import call_textract, Textract_Features
features = [Textract_Features.TABLES]
response = call_textract(
    input_document="/folder/local-filesystem-file.png", features=features)
```

### Call with images located on S3 but force asynchronous API

```python
from textractcaller import call_textract
response = call_textract(input_document="s3://some-bucket/w2-example.png", force_async_api=True)
```

### Call with OutputConfig, Customer-Managed-Key

```python
from textractcaller import call_textract
from textractcaller import OutputConfig, Textract_Features
output_config = OutputConfig(s3_bucket="somebucket-encrypted", s3_prefix="output/")
response = call_textract(input_document="s3://someprefix/somefile.png",
                          force_async_api=True,
                          output_config=output_config,
                          kms_key_id="arn:aws:kms:us-east-1:12345678901:key/some-key-id-ref-erence",
                          return_job_id=False,
                          job_tag="sometag",
                          client_request_token="sometoken")

```

### Call with PDF located on S3 and force return of JobId instead of JSON response

```python
from textractcaller import call_textract
response = call_textract(input_document="s3://some-bucket/some-document.pdf", return_job_id=True)
job_id = response['JobId']
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/aws-samples/amazon-textract-textractor/tree/master/caller",
    "name": "amazon-textract-caller",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "amazon-textract-textractor amazon textract textractor helper caller",
    "author": "Amazon Rekognition Textract Demoes",
    "author_email": "rekognition-textract-demos@amazon.com",
    "download_url": "https://files.pythonhosted.org/packages/fe/62/82eada03a5bbedff817090e3365d883c354a9c10cf66f4d8f15af145828f/amazon-textract-caller-0.2.4.tar.gz",
    "platform": null,
    "description": "# Textract-Caller\n\namazon-textract-caller provides a collection of ready to use functions and sample implementations to speed up the evaluation and development for any project using Amazon Textract.\n\nMaking it easy to call Amazon Textract regardless of file type and location.\n\n## Install\n\n```bash\n> python -m pip install amazon-textract-caller\n```\n\n## Functions\n\n```python\nfrom textractcaller import call_textract\ndef call_textract(input_document: Union[str, bytes],\n                  features: Optional[List[Textract_Features]] = None,\n                  queries_config: Optional[QueriesConfig] = None,\n                  output_config: Optional[OutputConfig] = None,\n                  adapters_config: Optional[AdaptersConfig] = None,\n                  kms_key_id: str = \"\",\n                  job_tag: str = \"\",\n                  notification_channel: Optional[NotificationChannel] = None,\n                  client_request_token: str = \"\",\n                  return_job_id: bool = False,\n                  force_async_api: bool = False,\n                  call_mode: Textract_Call_Mode = Textract_Call_Mode.DEFAULT,\n                  boto3_textract_client=None,\n                  job_done_polling_interval=1) -> dict:\n```\n\nAlso useful when receiving the JSON response from an asynchronous job (start_document_text_detection or start_document_analysis)\n\n```python\nfrom textractcaller import get_full_json\ndef get_full_json(job_id: str = None,\n                  textract_api: Textract_API = Textract_API.DETECT,\n                  boto3_textract_client=None)->dict:\n```\n\nAnd when receiving the JSON from the OutputConfig location, this method is useful as well.\n\n```python\nfrom textractcaller import get_full_json_from_output_config\ndef get_full_json_from_output_config(output_config: OutputConfig = None,\n                                     job_id: str = None,\n                                     s3_client = None)->dict:\n```\n\n## Samples\n\n### Calling with file from local filesystem only with detect_text\n\n```python\ntextract_json = call_textract(input_document=\"/folder/local-filesystem-file.png\")\n```\n\n### Calling with file from local filesystem only detect_text and using in Textract Response Parser\n\n(needs trp dependency through ```python -m pip install amazon-textract-response-parser```)\n\n```python\nimport json\nfrom trp import Document\nfrom textractcaller import call_textract\n\ntextract_json = call_textract(input_document=\"/folder/local-filesystem-file.png\")\nd = Document(textract_json)\n```\n\n### Calling with Queries for a multi-page document and extract the Answers\n\nsample also uses the amazon-textract-response-parser\n\n```\npython -m pip install amazon-textract-caller amazon-textract-response-parser\n```\n\n```python\nimport textractcaller as tc\nimport trp.trp2 as t2\nimport boto3\n\ntextract = boto3.client('textract', region_name=\"us-east-2\")\nq1 = tc.Query(text=\"What is the employee SSN?\", alias=\"SSN\", pages=[\"1\"])\nq2 = tc.Query(text=\"What is YTD gross pay?\", alias=\"GROSS_PAY\", pages=[\"2\"])\ntextract_json = tc.call_textract(\n    input_document=\"s3://amazon-textract-public-content/blogs/2-pager.pdf\",\n    queries_config=tc.QueriesConfig(queries=[q1, q2]),\n    features=[tc.Textract_Features.QUERIES],\n    force_async_api=True,\n    boto3_textract_client=textract)\nt_doc: t2.TDocument = t2.TDocumentSchema().load(textract_json)  # type: ignore\nfor page in t_doc.pages:\n    query_answers = t_doc.get_query_answers(page=page)\n    for x in query_answers:\n        print(f\"{x[1]},{x[2]}\")\n```\n\n### Calling with Custom Queries for a multi-page document using an adapter\n\nsample also uses the amazon-textract-response-parser\n\n```\npython -m pip install amazon-textract-caller amazon-textract-response-parser\n```\n\n```python\nimport textractcaller as tc\nimport trp.trp2 as t2\nimport boto3\n\ntextract = boto3.client('textract', region_name=\"us-east-2\")\nq1 = tc.Query(text=\"What is the employee SSN?\", alias=\"SSN\", pages=[\"1\"])\nq2 = tc.Query(text=\"What is YTD gross pay?\", alias=\"GROSS_PAY\", pages=[\"2\"])\nadapter1 = tc.Adapter(adapter_id=\"2e9bf1c4aa31\", version=\"1\", pages=[\"1\"])\ntextract_json = tc.call_textract(\n    input_document=\"s3://amazon-textract-public-content/blogs/2-pager.pdf\",\n    queries_config=tc.QueriesConfig(queries=[q1, q2]),\n    adapters_config=tc.AdaptersConfig(adapters=[adapter1])\n    features=[tc.Textract_Features.QUERIES],\n    force_async_api=True,\n    boto3_textract_client=textract)\nt_doc: t2.TDocument = t2.TDocumentSchema().load(textract_json)  # type: ignore\nfor page in t_doc.pages:\n    query_answers = t_doc.get_query_answers(page=page)\n    for x in query_answers:\n        print(f\"{x[1]},{x[2]}\")\n```\n\n\n### Calling with file from local filesystem with TABLES features\n\n```python\nfrom textractcaller import call_textract, Textract_Features\nfeatures = [Textract_Features.TABLES]\nresponse = call_textract(\n    input_document=\"/folder/local-filesystem-file.png\", features=features)\n```\n\n### Call with images located on S3 but force asynchronous API\n\n```python\nfrom textractcaller import call_textract\nresponse = call_textract(input_document=\"s3://some-bucket/w2-example.png\", force_async_api=True)\n```\n\n### Call with OutputConfig, Customer-Managed-Key\n\n```python\nfrom textractcaller import call_textract\nfrom textractcaller import OutputConfig, Textract_Features\noutput_config = OutputConfig(s3_bucket=\"somebucket-encrypted\", s3_prefix=\"output/\")\nresponse = call_textract(input_document=\"s3://someprefix/somefile.png\",\n                          force_async_api=True,\n                          output_config=output_config,\n                          kms_key_id=\"arn:aws:kms:us-east-1:12345678901:key/some-key-id-ref-erence\",\n                          return_job_id=False,\n                          job_tag=\"sometag\",\n                          client_request_token=\"sometoken\")\n\n```\n\n### Call with PDF located on S3 and force return of JobId instead of JSON response\n\n```python\nfrom textractcaller import call_textract\nresponse = call_textract(input_document=\"s3://some-bucket/some-document.pdf\", return_job_id=True)\njob_id = response['JobId']\n```\n",
    "bugtrack_url": null,
    "license": "Apache License Version 2.0",
    "summary": "Amazon Textract Caller tools",
    "version": "0.2.4",
    "project_urls": {
        "Homepage": "https://github.com/aws-samples/amazon-textract-textractor/tree/master/caller"
    },
    "split_keywords": [
        "amazon-textract-textractor",
        "amazon",
        "textract",
        "textractor",
        "helper",
        "caller"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "06521712e298e0afbd8824a8e521ac8c39db2b9ad0e26e51a48e5a7c77487537",
                "md5": "e217e836d624b9ce1fb513695373362d",
                "sha256": "ec7dc3517f1cc9b37b41a74b2b5ea040d67be91e8559a8150f44af75bf7f5590"
            },
            "downloads": -1,
            "filename": "amazon_textract_caller-0.2.4-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e217e836d624b9ce1fb513695373362d",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.6",
            "size": 13682,
            "upload_time": "2024-06-20T22:14:41",
            "upload_time_iso_8601": "2024-06-20T22:14:41.835270Z",
            "url": "https://files.pythonhosted.org/packages/06/52/1712e298e0afbd8824a8e521ac8c39db2b9ad0e26e51a48e5a7c77487537/amazon_textract_caller-0.2.4-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fe6282eada03a5bbedff817090e3365d883c354a9c10cf66f4d8f15af145828f",
                "md5": "2aefe8313f29ff01cd08e2b6344e12e9",
                "sha256": "ac9848322fba92bee8a2f5dc9f9f7f208a181e2754312ccf02f97e6126de7059"
            },
            "downloads": -1,
            "filename": "amazon-textract-caller-0.2.4.tar.gz",
            "has_sig": false,
            "md5_digest": "2aefe8313f29ff01cd08e2b6344e12e9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 13193,
            "upload_time": "2024-06-20T22:14:43",
            "upload_time_iso_8601": "2024-06-20T22:14:43.537518Z",
            "url": "https://files.pythonhosted.org/packages/fe/62/82eada03a5bbedff817090e3365d883c354a9c10cf66f4d8f15af145828f/amazon-textract-caller-0.2.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-20 22:14:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "aws-samples",
    "github_project": "amazon-textract-textractor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "amazon-textract-caller",
            "specs": [
                [
                    "<",
                    "1"
                ],
                [
                    ">=",
                    "0.2.4"
                ]
            ]
        },
        {
            "name": "Pillow",
            "specs": []
        },
        {
            "name": "tabulate",
            "specs": [
                [
                    ">=",
                    "0.9"
                ],
                [
                    "<",
                    "0.10"
                ]
            ]
        },
        {
            "name": "XlsxWriter",
            "specs": [
                [
                    "<",
                    "4"
                ],
                [
                    ">=",
                    "3.0"
                ]
            ]
        },
        {
            "name": "editdistance",
            "specs": [
                [
                    ">=",
                    "0.6.2"
                ],
                [
                    "<",
                    "0.9"
                ]
            ]
        }
    ],
    "lcname": "amazon-textract-caller"
}
        
Elapsed time: 0.23975s