# Textractor-Textract-Helper
amazon-textract-helper provides a collection of ready to use functions and sample implementations to speed up the evaluation and development for any project using Amazon Textract.
It installs a command line tool called ```amazon-textract```
# Install
```bash
> python -m pip install amazon-textract-helper
```
Make sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)
# Test
```bash
> amazon-textract --help
usage: amazon-textract [-h] (--input-document INPUT_DOCUMENT | --example | --stdin) [--features {FORMS,TABLES} [{FORMS,TABLES} ...]]
[--pretty-print {WORDS,LINES,FORMS,TABLES} [{WORDS,LINES,FORMS,TABLES} ...]]
[--pretty-print-table-format {csv,plain,simple,github,grid,fancy_grid,pipe,orgtbl,jira,presto,pretty,psql,rst,medi
awiki,moinmoin,youtrack,html,unsafehtml,latex,latex_raw,latex_booktabs,latex_longtable,textile,tsv}]
[--overlay {WORD,LINE,FORM,KEY,VALUE,TABLE,CELL} [{WORD,LINE,FORM,KEY,VALUE,TABLE,CELL} ...]]
[--pop-up-overlay-output] [--overlay-output-folder OVERLAY_OUTPUT_FOLDER] [--version] [--no-stdout] [-v | -vv]
optional arguments:
-h, --help show this help message and exit
--input-document INPUT_DOCUMENT
s3 object (s3://) or file from local filesystem
--example using the example document to call Textract
--stdin receive JSON from stdin
--features {FORMS,TABLES} [{FORMS,TABLES} ...]
features to call Textract with. Will trigger call to AnalyzeDocument instead of DetectDocumentText
--pretty-print {WORDS,LINES,FORMS,TABLES} [{WORDS,LINES,FORMS,TABLES} ...]
--pretty-print-table-format {csv,plain,simple,github,grid,fancy_grid,pipe,orgtbl,jira,presto,pretty,psql,rst,mediawiki,moinmoin,youtrac
k,html,unsafehtml,latex,latex_raw,latex_booktabs,latex_longtable,textile,tsv}
which format to output the pretty print information to. Only effects FORMS and TABLES
--overlay {WORD,LINE,FORM,KEY,VALUE,TABLE,CELL} [{WORD,LINE,FORM,KEY,VALUE,TABLE,CELL} ...]
defines what bounding boxes to draw on the output
--pop-up-overlay-output
shows image with overlay
--overlay-text shows image with WORD or LINE text overlay. When both WORD and LINE overlay are specified, WORD text will be overlayed
--overlay-confidence shows image with confidence overlay
--overlay-output-folder OVERLAY_OUTPUT_FOLDER
output with bounding boxes to folder
--version print version information
--no-stdout no output to stdout
-v >=INFO level logging output to stderr
-vv >=DEBUG level logging output to stderr
```
# Sample Commands
## Easy Start
```bash
> amazon-textract --example
```
this will run the examples document using the DetectDocumentText API.
Output will be printed to stdout and look similar to this:
```json
{"DocumentMetadata": {"Pages": 1}, "Blocks": [{"BlockType": "PAGE", "Geometry": {"BoundingBox": {"Width": 1.0, "Height": 1.0, "Left": 0.0
, "Top": 0.0}, "Polygon": [{"X": 9.33321120033382e-17, "Y": 0.0}, {"X": 1.0, "Y": 1.6069064689339292e-16}, {"X": 1.0, "Y": 1.0}],
"HTTPHeaders": {"x-amzn-requestid": "12345678-1234-1234-1234-123456789012", "content-type": "a
pplication/x-amz-json-1.1", "content-length": "48177", "date": "Thu, 01 Apr 2021 21:50:29 GMT"}, "RetryAttempts": 0}}
```
It is working.
## Call with document on S3
```bash
> amazon-textract --input-document "s3://somebucket/someprefix/someobjectname.png"
```
Output similar to Easy Start
## Call with document on local file system
```bash
> amazon-textract --input-document "./somepath/somefilename.png"
```
Output similar to Easy Start
We will continue to use the ```--example``` parameter to keep it simple and easy to reproduce. S3 and local files work the same way, just instead of --example use --input-document <location>.
## Call with STDIN
```bash
# first create JSON
amazon-textract --example > example.json
# now use a stored JSON with the ```amazon-textract``` command
cat example.json | amazon-textract --stdin -pretty-print LINES
```
## Call with FORMS and TABLES
```bash
> amazon-textract --example --features FORMS TABLES
```
This will call the [AnalyzeDocument API] (https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeDocument.html) and output will include
Output will look similar to "Easy Start" but include FORMS and TABLES information
## Pretty print the output
Pretty print outputs nicely formatted information for words, lines, forms or tables.
For example to print the tables identified by Amazon Textract to stdout, use
```bash
> amazon-textract --example --features TABLES --pretty-print TABLES
```
Output will look like this:
```text
|------------|-----------|---------------------|-----------------|-----------------------|
| | | Previous Employment | History | |
| Start Date | End Date | Employer Name | Position Held | Reason for leaving |
| 1/15/2009 | 6/30/2011 | Any Company | Assistant Baker | Family relocated |
| 7/1/2011 | 8/10/2013 | Best Corp. | Baker | Better opportunity |
| 8/15/2013 | present | Example Corp. | Head Baker | N/A, current employer |
```
to pretty print both, FORMS and TABLES:
```bash
> amazon-textract --example --features FORMS TABLES --pretty-print FORMS TABLES
```
will output
```text
Phone Number:: 555-0100
Home Address:: 123 Any Street, Any Town, USA
Full Name:: Jane Doe
Mailing Address:: same as home address
|------------|-----------|---------------------|-----------------|-----------------------|
| | | Previous Employment | History | |
| Start Date | End Date | Employer Name | Position Held | Reason for leaving |
| 1/15/2009 | 6/30/2011 | Any Company | Assistant Baker | Family relocated |
| 7/1/2011 | 8/10/2013 | Best Corp. | Baker | Better opportunity |
| 8/15/2013 | present | Example Corp. | Head Baker | N/A, current employer |
```
## Overlay
**At the moment overlay only works with images, we will add support for PDF soon.**
The following command runs DetectDocumentText, pretty prints the WORDS in the document to stdout and draws bounding boxes around each WORD and displays the result in a popup window and stores it to a folder called 'overlay-output-folder-name'.
```bash
amazon-textract --example --pretty-print WORDS --overlay WORD --pop-up-overlay-output --overlay-output-folder overlay-output-folder-name
```
<img src="https://github.com/aws-samples/amazon-textract-textractor/blob/master/helper/docs/employmentapp_boxed_WORD_.png" alt="Sample overlay WORD" width="50%" height="50%" border="1">
The following command runs AnalyzeDocument for FORMS and TABLES, pretty prints FORMS and TABLES to to stdout and draws bounding boxes around each TABLE-CELL and FORM KEY/VALUE and displays the result in a popup window and stores it to a folder called 'overlay-output-folder-name'.
```bash
> amazon-textract --example --features TABLES FORMS --pretty-print FORMS TABLES --overlay FORM CELL --pop-up-overlay-output --overlay-output-folder ../mywonderfuloutputfolderfordocs/
```
<img src="https://github.com/aws-samples/amazon-textract-textractor/blob/master/helper/docs/employmentapp_boxed_FORM_CELL_.png" alt="Sample overlay FORM CELL" width="50%" height="50%" border="1">
The following command draws bounding boxes around each WORD, overlays the detected WORD text, and displays the result in a popup window and stores it to a folder called 'overlay-output-folder-name'.
```bash
> amazon-textract --example --overlay WORD --overlay-text --pop-up-overlay-output --overlay-output-folder overlay-output-folder-name
```
<img src="https://github.com/aws-samples/amazon-textract-textractor/blob/master/helper/docs/employmentapp_boxed_WORD_TEXT_OVERLAY.png" alt="Sample overlay LINE with overlay text and confidence percentage" width="50%" height="50%" border="1">
The following command draws bounding boxes around each LINE, overlays LINE text along with percentage confidence of the detected LINE text, and displays the result in a popup window and stores it to a folder called 'overlay-output-folder-name'.
```bash
> amazon-textract --example --overlay LINE --overlay-text --overlay-confidence --pop-up-overlay-output --overlay-output-folder overlay-output-folder-name
```
<img src="https://github.com/aws-samples/amazon-textract-textractor/blob/master/helper/docs/employmentapp_boxed_LINE_TEXT_OVERLAY.png" alt="Sample overlay LINE with overlay text and confidence percentage" width="50%" height="50%" border="1">
Raw data
{
"_id": null,
"home_page": "https://github.com/aws-samples/amazon-textract-textractor/tree/master/helper",
"name": "amazon-textract-helper",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "amazon-textract-textractor amazon textract textractor helper",
"author": "Amazon Rekognition Textract Demoes",
"author_email": "rekognition-textract-demos@amazon.com",
"download_url": "https://files.pythonhosted.org/packages/8c/2b/e6b0aca31d5504bac5d4f9b47b8ba6cbb816c0bc787f3839456aa61138a8/amazon-textract-helper-0.0.35.tar.gz",
"platform": null,
"description": "# Textractor-Textract-Helper\n\namazon-textract-helper provides a collection of ready to use functions and sample implementations to speed up the evaluation and development for any project using Amazon Textract.\nIt installs a command line tool called ```amazon-textract```\n\n\n# Install\n\n```bash\n> python -m pip install amazon-textract-helper\n```\n\nMake sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)\n\n# Test\n\n```bash\n> amazon-textract --help\nusage: amazon-textract [-h] (--input-document INPUT_DOCUMENT | --example | --stdin) [--features {FORMS,TABLES} [{FORMS,TABLES} ...]]\n [--pretty-print {WORDS,LINES,FORMS,TABLES} [{WORDS,LINES,FORMS,TABLES} ...]]\n [--pretty-print-table-format {csv,plain,simple,github,grid,fancy_grid,pipe,orgtbl,jira,presto,pretty,psql,rst,medi\nawiki,moinmoin,youtrack,html,unsafehtml,latex,latex_raw,latex_booktabs,latex_longtable,textile,tsv}]\n [--overlay {WORD,LINE,FORM,KEY,VALUE,TABLE,CELL} [{WORD,LINE,FORM,KEY,VALUE,TABLE,CELL} ...]]\n [--pop-up-overlay-output] [--overlay-output-folder OVERLAY_OUTPUT_FOLDER] [--version] [--no-stdout] [-v | -vv]\n\noptional arguments:\n -h, --help show this help message and exit\n --input-document INPUT_DOCUMENT\n s3 object (s3://) or file from local filesystem\n --example using the example document to call Textract\n --stdin receive JSON from stdin\n --features {FORMS,TABLES} [{FORMS,TABLES} ...]\n features to call Textract with. Will trigger call to AnalyzeDocument instead of DetectDocumentText\n --pretty-print {WORDS,LINES,FORMS,TABLES} [{WORDS,LINES,FORMS,TABLES} ...]\n --pretty-print-table-format {csv,plain,simple,github,grid,fancy_grid,pipe,orgtbl,jira,presto,pretty,psql,rst,mediawiki,moinmoin,youtrac\nk,html,unsafehtml,latex,latex_raw,latex_booktabs,latex_longtable,textile,tsv}\n which format to output the pretty print information to. Only effects FORMS and TABLES\n --overlay {WORD,LINE,FORM,KEY,VALUE,TABLE,CELL} [{WORD,LINE,FORM,KEY,VALUE,TABLE,CELL} ...]\n defines what bounding boxes to draw on the output\n --pop-up-overlay-output\n shows image with overlay\n --overlay-text shows image with WORD or LINE text overlay. When both WORD and LINE overlay are specified, WORD text will be overlayed\n --overlay-confidence shows image with confidence overlay\n --overlay-output-folder OVERLAY_OUTPUT_FOLDER\n output with bounding boxes to folder\n --version print version information\n --no-stdout no output to stdout\n -v >=INFO level logging output to stderr\n -vv >=DEBUG level logging output to stderr\n```\n\n# Sample Commands\n\n## Easy Start\n\n```bash\n> amazon-textract --example\n```\n\nthis will run the examples document using the DetectDocumentText API.\nOutput will be printed to stdout and look similar to this:\n\n```json\n{\"DocumentMetadata\": {\"Pages\": 1}, \"Blocks\": [{\"BlockType\": \"PAGE\", \"Geometry\": {\"BoundingBox\": {\"Width\": 1.0, \"Height\": 1.0, \"Left\": 0.0\n, \"Top\": 0.0}, \"Polygon\": [{\"X\": 9.33321120033382e-17, \"Y\": 0.0}, {\"X\": 1.0, \"Y\": 1.6069064689339292e-16}, {\"X\": 1.0, \"Y\": 1.0}],\n\"HTTPHeaders\": {\"x-amzn-requestid\": \"12345678-1234-1234-1234-123456789012\", \"content-type\": \"a\npplication/x-amz-json-1.1\", \"content-length\": \"48177\", \"date\": \"Thu, 01 Apr 2021 21:50:29 GMT\"}, \"RetryAttempts\": 0}}\n```\n\nIt is working.\n\n## Call with document on S3\n\n```bash\n> amazon-textract --input-document \"s3://somebucket/someprefix/someobjectname.png\"\n```\n\nOutput similar to Easy Start\n\n## Call with document on local file system\n\n```bash\n> amazon-textract --input-document \"./somepath/somefilename.png\"\n```\n\nOutput similar to Easy Start\n\nWe will continue to use the ```--example``` parameter to keep it simple and easy to reproduce. S3 and local files work the same way, just instead of --example use --input-document <location>.\n\n## Call with STDIN\n\n```bash\n# first create JSON\namazon-textract --example > example.json\n# now use a stored JSON with the ```amazon-textract``` command\ncat example.json | amazon-textract --stdin -pretty-print LINES\n```\n\n## Call with FORMS and TABLES\n\n```bash\n> amazon-textract --example --features FORMS TABLES\n```\n\nThis will call the [AnalyzeDocument API] (https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeDocument.html) and output will include\nOutput will look similar to \"Easy Start\" but include FORMS and TABLES information\n\n## Pretty print the output\n\nPretty print outputs nicely formatted information for words, lines, forms or tables.\n\nFor example to print the tables identified by Amazon Textract to stdout, use\n\n```bash\n> amazon-textract --example --features TABLES --pretty-print TABLES\n```\n\nOutput will look like this:\n\n```text\n|------------|-----------|---------------------|-----------------|-----------------------|\n| | | Previous Employment | History | |\n| Start Date | End Date | Employer Name | Position Held | Reason for leaving |\n| 1/15/2009 | 6/30/2011 | Any Company | Assistant Baker | Family relocated |\n| 7/1/2011 | 8/10/2013 | Best Corp. | Baker | Better opportunity |\n| 8/15/2013 | present | Example Corp. | Head Baker | N/A, current employer |\n\n```\n\nto pretty print both, FORMS and TABLES:\n\n```bash\n> amazon-textract --example --features FORMS TABLES --pretty-print FORMS TABLES\n```\n\nwill output\n\n```text\nPhone Number:: 555-0100\nHome Address:: 123 Any Street, Any Town, USA\nFull Name:: Jane Doe\nMailing Address:: same as home address\n|------------|-----------|---------------------|-----------------|-----------------------|\n| | | Previous Employment | History | |\n| Start Date | End Date | Employer Name | Position Held | Reason for leaving |\n| 1/15/2009 | 6/30/2011 | Any Company | Assistant Baker | Family relocated |\n| 7/1/2011 | 8/10/2013 | Best Corp. | Baker | Better opportunity |\n| 8/15/2013 | present | Example Corp. | Head Baker | N/A, current employer |\n```\n\n## Overlay\n\n**At the moment overlay only works with images, we will add support for PDF soon.**\n\nThe following command runs DetectDocumentText, pretty prints the WORDS in the document to stdout and draws bounding boxes around each WORD and displays the result in a popup window and stores it to a folder called 'overlay-output-folder-name'.\n\n```bash\namazon-textract --example --pretty-print WORDS --overlay WORD --pop-up-overlay-output --overlay-output-folder overlay-output-folder-name\n```\n\n<img src=\"https://github.com/aws-samples/amazon-textract-textractor/blob/master/helper/docs/employmentapp_boxed_WORD_.png\" alt=\"Sample overlay WORD\" width=\"50%\" height=\"50%\" border=\"1\">\n\n\nThe following command runs AnalyzeDocument for FORMS and TABLES, pretty prints FORMS and TABLES to to stdout and draws bounding boxes around each TABLE-CELL and FORM KEY/VALUE and displays the result in a popup window and stores it to a folder called 'overlay-output-folder-name'.\n\n```bash\n> amazon-textract --example --features TABLES FORMS --pretty-print FORMS TABLES --overlay FORM CELL --pop-up-overlay-output --overlay-output-folder ../mywonderfuloutputfolderfordocs/\n```\n\n\n<img src=\"https://github.com/aws-samples/amazon-textract-textractor/blob/master/helper/docs/employmentapp_boxed_FORM_CELL_.png\" alt=\"Sample overlay FORM CELL\" width=\"50%\" height=\"50%\" border=\"1\">\n\n\nThe following command draws bounding boxes around each WORD, overlays the detected WORD text, and displays the result in a popup window and stores it to a folder called 'overlay-output-folder-name'.\n\n```bash\n> amazon-textract --example --overlay WORD --overlay-text --pop-up-overlay-output --overlay-output-folder overlay-output-folder-name\n```\n\n\n<img src=\"https://github.com/aws-samples/amazon-textract-textractor/blob/master/helper/docs/employmentapp_boxed_WORD_TEXT_OVERLAY.png\" alt=\"Sample overlay LINE with overlay text and confidence percentage\" width=\"50%\" height=\"50%\" border=\"1\">\n\n\nThe following command draws bounding boxes around each LINE, overlays LINE text along with percentage confidence of the detected LINE text, and displays the result in a popup window and stores it to a folder called 'overlay-output-folder-name'.\n\n```bash\n> amazon-textract --example --overlay LINE --overlay-text --overlay-confidence --pop-up-overlay-output --overlay-output-folder overlay-output-folder-name\n```\n\n\n<img src=\"https://github.com/aws-samples/amazon-textract-textractor/blob/master/helper/docs/employmentapp_boxed_LINE_TEXT_OVERLAY.png\" alt=\"Sample overlay LINE with overlay text and confidence percentage\" width=\"50%\" height=\"50%\" border=\"1\">\n\n",
"bugtrack_url": null,
"license": "Apache License Version 2.0",
"summary": "Amazon Textract Helper tools",
"version": "0.0.35",
"project_urls": {
"Homepage": "https://github.com/aws-samples/amazon-textract-textractor/tree/master/helper"
},
"split_keywords": [
"amazon-textract-textractor",
"amazon",
"textract",
"textractor",
"helper"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6ea6a16de3e84d357a68357dcda378bc80cd2def24e0065dfc0e692128f8d056",
"md5": "2463f4fafea9515da5e21f7620ba8037",
"sha256": "4f727e480da17d5cf8060a5b1f1c045f61331f782c90cbd0084b20fd92808ab9"
},
"downloads": -1,
"filename": "amazon_textract_helper-0.0.35-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "2463f4fafea9515da5e21f7620ba8037",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.6",
"size": 298117,
"upload_time": "2023-11-14T16:47:29",
"upload_time_iso_8601": "2023-11-14T16:47:29.034904Z",
"url": "https://files.pythonhosted.org/packages/6e/a6/a16de3e84d357a68357dcda378bc80cd2def24e0065dfc0e692128f8d056/amazon_textract_helper-0.0.35-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "8c2be6b0aca31d5504bac5d4f9b47b8ba6cbb816c0bc787f3839456aa61138a8",
"md5": "d0d8c2a102ac3adaa3f179503637f414",
"sha256": "e5dc1afc4ade3cb5aa247ca556ee26a2e8e2b5e3aa7e176f3449df445262a53d"
},
"downloads": -1,
"filename": "amazon-textract-helper-0.0.35.tar.gz",
"has_sig": false,
"md5_digest": "d0d8c2a102ac3adaa3f179503637f414",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 1419526,
"upload_time": "2023-11-14T16:47:31",
"upload_time_iso_8601": "2023-11-14T16:47:31.730003Z",
"url": "https://files.pythonhosted.org/packages/8c/2b/e6b0aca31d5504bac5d4f9b47b8ba6cbb816c0bc787f3839456aa61138a8/amazon-textract-helper-0.0.35.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-14 16:47:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "aws-samples",
"github_project": "amazon-textract-textractor",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "amazon-textract-helper"
}