# Textract-Pipeline-GeoFinder
Provides functions to use geometric information to extract information.
Use cases include:
* Give context to key/value pairs from the Amazon Textract AnalyzeDocument API for FORMS
* Find values in specific areas
# Install
```bash
> python -m pip install amazon-textract-geofinder
```
Make sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)
# Concept
To find information in a document based on geometry with this library the main advantage over defining x,y coordinates where the expected value should be is the concept of an area.
An area is ultimately defined by a box with x_min, y_min, x_max, y_max coordinates but can be defined by finding words/phrases in the document and then use to create the area.
From there functions to parse the information in the area help to extract the information. E. g. by defining the area based on the question like 'Did you feel fever or feverish lately?' we can associate the answers to it and create a new key/value pair specific to this question.
# Samples
## Get context for key value pairs
Sample image:
<img src="./tests/data/patient_intake_form_sample.jpg" width=300>
The Amazon Textract AnalyzeDocument API with the FORMS feature returns the following keys:
| Key | Value |
|----------------------------------------------|----------------|
| First Name: | ALEJANDRO |
| First Name: | CARLOS |
| Relationship to Patient: | BROTHER |
| First Name: | JANE |
| Marital Status: | MARRIED |
| Phone: | 646-555-0111 |
| Last Name: | SALAZAR |
| Phone: | 212-555-0150 |
| Relationship to Patient: | FRIEND |
| Last Name: | ROSALEZ |
| City: | ANYTOWN |
| Phone: | 650-555-0123 |
| Address: | 123 ANY STREET |
| Yes | SELECTED |
| Yes | NOT_SELECTED |
| Date of Birth: | 10/10/1982 |
| Last Name: | DOE |
| Sex: | M |
| Yes | NOT_SELECTED |
| Yes | NOT_SELECTED |
| Yes | NOT_SELECTED |
| State: | CA |
| Zip Code: | 12345 |
| Email Address: | |
| No | NOT_SELECTED |
| No | SELECTED |
| No | NOT_SELECTED |
| Yes | SELECTED |
| No | SELECTED |
| No | SELECTED |
| No | SELECTED |
But the information to which section of the document the individual keys belong is not obvious. Most keys appear multiple times and we want to give them context to associate them with the 'Patient', 'Emergency Contact 1', 'Emergency Contact 2' or specific questions.
This Jupyter notebook that walks through the sample: [sample notebook](./geofinder-sample-notebook.ipynb)
Make sure to have AWS credentials setup when starting the notebook locally or use a SageMaker notebook with a role including permissions for Amazon Textract.
This code snippet is take from the notebook.
```bash
python -m pip install amazon-textract-helper amazon-textract-geofinder
```
```python
from textractgeofinder.ocrdb import AreaSelection
from textractgeofinder.tgeofinder import KeyValue, TGeoFinder, AreaSelection, SelectionElement
from textractprettyprinter.t_pretty_print import get_forms_string
from textractcaller import call_textract
from textractcaller.t_call import Textract_Features
import trp.trp2 as t2
image_filename='./tests/data/patient_intake_form_sample.jpg'
j = call_textract(input_document=image_filename, features=[Textract_Features.FORMS])
t_document = t2.TDocumentSchema().load(j)
doc_height = 1000
doc_width = 1000
geofinder_doc = TGeoFinder(j, doc_height=doc_height, doc_width=doc_width)
def set_hierarchy_kv(list_kv: list[KeyValue], t_document: t2.TDocument, page_block: t2.TBlock, prefix="BORROWER"):
for x in list_kv:
t_document.add_virtual_key_for_existing_key(key_name=f"{prefix}_{x.key.text}",
existing_key=t_document.get_block_by_id(x.key.id),
page_block=page_block)
# patient information
patient_information = geofinder_doc.find_phrase_on_page("patient information")[0]
emergency_contact_1 = geofinder_doc.find_phrase_on_page("emergency contact 1:", min_textdistance=0.99)[0]
top_left = t2.TPoint(y=patient_information.ymax, x=0)
lower_right = t2.TPoint(y=emergency_contact_1.ymin, x=doc_width)
form_fields = geofinder_doc.get_form_fields_in_area(
area_selection=AreaSelection(top_left=top_left, lower_right=lower_right))
set_hierarchy_kv(list_kv=form_fields, t_document=t_document, prefix='PATIENT', page_block=t_document.pages[0])
set_hierarchy_kv(list_kv=form_fields, t_document=t_document, prefix='PATIENT', page_block=t_document.pages[0])
print(get_forms_string(t2.TDocumentSchema().dump(t_document)))
```
| Key | Value |
|-------------------------|----------------|
| ... | ... |
| PATIENT_first name: | ALEJANDRO |
| PATIENT_address: | 123 ANY STREET |
| PATIENT_sex: | M |
| PATIENT_state: | CA |
| PATIENT_zip code: | 12345 |
| PATIENT_marital status: | MARRIED |
| PATIENT_last name: | ROSALEZ |
| PATIENT_phone: | 646-555-0111 |
| PATIENT_email address: | |
| PATIENT_city: | ANYTOWN |
| PATIENT_date of birth: | 10/10/1982 |
## Using the Amazon Textact Helper command line tool with the sample
This will show the full result, like the notebook.
```bash
> python -m pip install amazon-textract-helper amazon-textract-geofinder
> cat tests/data/patient_intake_form_sample.json| bin/amazon-textract-geofinder | amazon-textract --stdin --pretty-print FORMS
```
| Key | Value |
|-------------------------|----------------|
| First Name: | ALEJANDRO |
| First Name: | CARLOS |
| Relationship to Patient: | BROTHER |
| First Name: | JANE |
| Marital Status: | MARRIED |
| Phone: | 646-555-0111 |
| Last Name: | SALAZAR |
| Phone: | 212-555-0150 |
| Relationship to Patient: | FRIEND |
| Last Name: | ROSALEZ |
| City: | ANYTOWN |
| Phone: | 650-555-0123 |
| Address: | 123 ANY STREET |
| Yes | SELECTED |
| Yes | NOT_SELECTED |
| Date of Birth: | 10/10/1982 |
| Last Name: | DOE |
| Sex: | M |
| Yes | NOT_SELECTED |
| Yes | NOT_SELECTED |
| Yes | NOT_SELECTED |
| State: | CA |
| Zip Code: | 12345 |
| Email Address: | |
| No | NOT_SELECTED |
| No | SELECTED |
| No | NOT_SELECTED |
| Yes | SELECTED |
| No | SELECTED |
| No | SELECTED |
| No | SELECTED |
| PATIENT_first name: | ALEJANDRO |
| PATIENT_address: | 123 ANY STREET |
| PATIENT_sex: | M |
| PATIENT_state: | CA |
| PATIENT_zip code: | 12345 |
| PATIENT_marital status: | MARRIED |
| PATIENT_last name: | ROSALEZ |
| PATIENT_phone: | 646-555-0111 |
| PATIENT_email address: | |
| PATIENT_city: | ANYTOWN |
| PATIENT_date of birth: | 10/10/1982 |
| EMERGENCY_CONTACT_1_first name: | CARLOS |
| EMERGENCY_CONTACT_1_phone: | 212-555-0150 |
| EMERGENCY_CONTACT_1_relationship to patient: | BROTHER |
| EMERGENCY_CONTACT_1_last name: | SALAZAR |
| EMERGENCY_CONTACT_2_first name: | JANE |
| EMERGENCY_CONTACT_2_phone: | 650-555-0123 |
| EMERGENCY_CONTACT_2_last name: | DOE |
| EMERGENCY_CONTACT_2_relationship to patient: | FRIEND |
| FEVER->YES | SELECTED |
| FEVER->NO | NOT_SELECTED |
| SHORTNESS->YES | NOT_SELECTED |
| SHORTNESS->NO | SELECTED |
| COUGH->YES | NOT_SELECTED |
| COUGH->NO | SELECTED |
| LOSS_OF_TASTE->YES | NOT_SELECTED |
| LOSS_OF_TASTE->NO | SELECTED |
| COVID_CONTACT->YES | SELECTED |
| COVID_CONTACT->NO | NOT_SELECTED |
| TRAVEL->YES | NOT_SELECTED |
| TRAVEL->NO | SELECTED |
Raw data
{
"_id": null,
"home_page": "https://github.com/aws-samples/amazon-textract-textractor/tpipelinegeofinder",
"name": "amazon-textract-geofinder",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "amazon-textract-textractor amazon textract finder geometry geo",
"author": "Amazon Rekognition Textract Demoes",
"author_email": "rekognition-textract-demos@amazon.com",
"download_url": "https://files.pythonhosted.org/packages/17/f7/b40bacdff49aa7864dd88a814fbb4e319ff7c5ffb70b7a46797eeda1d066/amazon-textract-geofinder-0.0.8.tar.gz",
"platform": null,
"description": "# Textract-Pipeline-GeoFinder\n\nProvides functions to use geometric information to extract information.\n\nUse cases include:\n* Give context to key/value pairs from the Amazon Textract AnalyzeDocument API for FORMS\n* Find values in specific areas\n\n# Install\n\n```bash\n> python -m pip install amazon-textract-geofinder\n```\n\nMake sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)\n\n# Concept\n\nTo find information in a document based on geometry with this library the main advantage over defining x,y coordinates where the expected value should be is the concept of an area.\n\nAn area is ultimately defined by a box with x_min, y_min, x_max, y_max coordinates but can be defined by finding words/phrases in the document and then use to create the area.\n\nFrom there functions to parse the information in the area help to extract the information. E. g. by defining the area based on the question like 'Did you feel fever or feverish lately?' we can associate the answers to it and create a new key/value pair specific to this question.\n\n\n# Samples\n\n## Get context for key value pairs\n\nSample image:\n\n<img src=\"./tests/data/patient_intake_form_sample.jpg\" width=300> \n\nThe Amazon Textract AnalyzeDocument API with the FORMS feature returns the following keys:\n\n| Key | Value |\n|----------------------------------------------|----------------|\n| First Name: | ALEJANDRO |\n| First Name: | CARLOS |\n| Relationship to Patient: | BROTHER |\n| First Name: | JANE |\n| Marital Status: | MARRIED |\n| Phone: | 646-555-0111 |\n| Last Name: | SALAZAR |\n| Phone: | 212-555-0150 |\n| Relationship to Patient: | FRIEND |\n| Last Name: | ROSALEZ |\n| City: | ANYTOWN |\n| Phone: | 650-555-0123 |\n| Address: | 123 ANY STREET |\n| Yes | SELECTED |\n| Yes | NOT_SELECTED |\n| Date of Birth: | 10/10/1982 |\n| Last Name: | DOE |\n| Sex: | M |\n| Yes | NOT_SELECTED |\n| Yes | NOT_SELECTED |\n| Yes | NOT_SELECTED |\n| State: | CA |\n| Zip Code: | 12345 |\n| Email Address: | |\n| No | NOT_SELECTED |\n| No | SELECTED |\n| No | NOT_SELECTED |\n| Yes | SELECTED |\n| No | SELECTED |\n| No | SELECTED |\n| No | SELECTED |\n\n\nBut the information to which section of the document the individual keys belong is not obvious. Most keys appear multiple times and we want to give them context to associate them with the 'Patient', 'Emergency Contact 1', 'Emergency Contact 2' or specific questions.\n\n\nThis Jupyter notebook that walks through the sample: [sample notebook](./geofinder-sample-notebook.ipynb)\nMake sure to have AWS credentials setup when starting the notebook locally or use a SageMaker notebook with a role including permissions for Amazon Textract. \n\nThis code snippet is take from the notebook.\n\n```bash\npython -m pip install amazon-textract-helper amazon-textract-geofinder\n```\n\n```python\nfrom textractgeofinder.ocrdb import AreaSelection\nfrom textractgeofinder.tgeofinder import KeyValue, TGeoFinder, AreaSelection, SelectionElement\nfrom textractprettyprinter.t_pretty_print import get_forms_string\nfrom textractcaller import call_textract\nfrom textractcaller.t_call import Textract_Features\n\nimport trp.trp2 as t2\n\nimage_filename='./tests/data/patient_intake_form_sample.jpg'\n\nj = call_textract(input_document=image_filename, features=[Textract_Features.FORMS])\n\n\nt_document = t2.TDocumentSchema().load(j)\ndoc_height = 1000\ndoc_width = 1000\ngeofinder_doc = TGeoFinder(j, doc_height=doc_height, doc_width=doc_width)\n\ndef set_hierarchy_kv(list_kv: list[KeyValue], t_document: t2.TDocument, page_block: t2.TBlock, prefix=\"BORROWER\"):\n for x in list_kv:\n t_document.add_virtual_key_for_existing_key(key_name=f\"{prefix}_{x.key.text}\",\n existing_key=t_document.get_block_by_id(x.key.id),\n page_block=page_block)\n# patient information\npatient_information = geofinder_doc.find_phrase_on_page(\"patient information\")[0]\nemergency_contact_1 = geofinder_doc.find_phrase_on_page(\"emergency contact 1:\", min_textdistance=0.99)[0]\ntop_left = t2.TPoint(y=patient_information.ymax, x=0)\nlower_right = t2.TPoint(y=emergency_contact_1.ymin, x=doc_width)\nform_fields = geofinder_doc.get_form_fields_in_area(\n area_selection=AreaSelection(top_left=top_left, lower_right=lower_right))\nset_hierarchy_kv(list_kv=form_fields, t_document=t_document, prefix='PATIENT', page_block=t_document.pages[0])\n\nset_hierarchy_kv(list_kv=form_fields, t_document=t_document, prefix='PATIENT', page_block=t_document.pages[0])\n\nprint(get_forms_string(t2.TDocumentSchema().dump(t_document)))\n```\n\n| Key | Value |\n|-------------------------|----------------|\n| ... | ... |\n| PATIENT_first name: | ALEJANDRO |\n| PATIENT_address: | 123 ANY STREET |\n| PATIENT_sex: | M |\n| PATIENT_state: | CA |\n| PATIENT_zip code: | 12345 |\n| PATIENT_marital status: | MARRIED |\n| PATIENT_last name: | ROSALEZ |\n| PATIENT_phone: | 646-555-0111 |\n| PATIENT_email address: | |\n| PATIENT_city: | ANYTOWN |\n| PATIENT_date of birth: | 10/10/1982 |\n\n## Using the Amazon Textact Helper command line tool with the sample\n\nThis will show the full result, like the notebook.\n\n```bash\n> python -m pip install amazon-textract-helper amazon-textract-geofinder\n> cat tests/data/patient_intake_form_sample.json| bin/amazon-textract-geofinder | amazon-textract --stdin --pretty-print FORMS\n```\n\n| Key | Value |\n|-------------------------|----------------|\n| First Name: | ALEJANDRO |\n| First Name: | CARLOS |\n| Relationship to Patient: | BROTHER |\n| First Name: | JANE |\n| Marital Status: | MARRIED |\n| Phone: | 646-555-0111 |\n| Last Name: | SALAZAR |\n| Phone: | 212-555-0150 |\n| Relationship to Patient: | FRIEND |\n| Last Name: | ROSALEZ |\n| City: | ANYTOWN |\n| Phone: | 650-555-0123 |\n| Address: | 123 ANY STREET |\n| Yes | SELECTED |\n| Yes | NOT_SELECTED |\n| Date of Birth: | 10/10/1982 |\n| Last Name: | DOE |\n| Sex: | M |\n| Yes | NOT_SELECTED |\n| Yes | NOT_SELECTED |\n| Yes | NOT_SELECTED |\n| State: | CA |\n| Zip Code: | 12345 |\n| Email Address: | |\n| No | NOT_SELECTED |\n| No | SELECTED |\n| No | NOT_SELECTED |\n| Yes | SELECTED |\n| No | SELECTED |\n| No | SELECTED |\n| No | SELECTED |\n| PATIENT_first name: | ALEJANDRO |\n| PATIENT_address: | 123 ANY STREET |\n| PATIENT_sex: | M |\n| PATIENT_state: | CA |\n| PATIENT_zip code: | 12345 |\n| PATIENT_marital status: | MARRIED |\n| PATIENT_last name: | ROSALEZ |\n| PATIENT_phone: | 646-555-0111 |\n| PATIENT_email address: | |\n| PATIENT_city: | ANYTOWN |\n| PATIENT_date of birth: | 10/10/1982 |\n| EMERGENCY_CONTACT_1_first name: | CARLOS |\n| EMERGENCY_CONTACT_1_phone: | 212-555-0150 |\n| EMERGENCY_CONTACT_1_relationship to patient: | BROTHER |\n| EMERGENCY_CONTACT_1_last name: | SALAZAR |\n| EMERGENCY_CONTACT_2_first name: | JANE |\n| EMERGENCY_CONTACT_2_phone: | 650-555-0123 |\n| EMERGENCY_CONTACT_2_last name: | DOE |\n| EMERGENCY_CONTACT_2_relationship to patient: | FRIEND |\n| FEVER->YES | SELECTED |\n| FEVER->NO | NOT_SELECTED |\n| SHORTNESS->YES | NOT_SELECTED |\n| SHORTNESS->NO | SELECTED |\n| COUGH->YES | NOT_SELECTED |\n| COUGH->NO | SELECTED |\n| LOSS_OF_TASTE->YES | NOT_SELECTED |\n| LOSS_OF_TASTE->NO | SELECTED |\n| COVID_CONTACT->YES | SELECTED |\n| COVID_CONTACT->NO | NOT_SELECTED |\n| TRAVEL->YES | NOT_SELECTED |\n| TRAVEL->NO | SELECTED |\n\n\n",
"bugtrack_url": null,
"license": "Apache License Version 2.0",
"summary": "Amazon Textract package to easier access data through geometric information",
"version": "0.0.8",
"project_urls": {
"Homepage": "https://github.com/aws-samples/amazon-textract-textractor/tpipelinegeofinder"
},
"split_keywords": [
"amazon-textract-textractor",
"amazon",
"textract",
"finder",
"geometry",
"geo"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "72473fc752ae7026d072019a8da41c860755c0bcddf5cc6cc538988afd7cc7b4",
"md5": "d1785f626706a7ae4ce2bcc118c197a9",
"sha256": "ff7e364c31d803f8ff519abd3d8c36562721446867b062a9793931a3789e38f1"
},
"downloads": -1,
"filename": "amazon_textract_geofinder-0.0.8-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "d1785f626706a7ae4ce2bcc118c197a9",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.6",
"size": 24019,
"upload_time": "2023-10-20T15:29:05",
"upload_time_iso_8601": "2023-10-20T15:29:05.027261Z",
"url": "https://files.pythonhosted.org/packages/72/47/3fc752ae7026d072019a8da41c860755c0bcddf5cc6cc538988afd7cc7b4/amazon_textract_geofinder-0.0.8-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "17f7b40bacdff49aa7864dd88a814fbb4e319ff7c5ffb70b7a46797eeda1d066",
"md5": "71519ccb90b48d077ea6d6b5fefd0294",
"sha256": "a33d5c9797e8d1289843f6d7866e73e71bc53953caa83c7fdc26439b499c7b44"
},
"downloads": -1,
"filename": "amazon-textract-geofinder-0.0.8.tar.gz",
"has_sig": false,
"md5_digest": "71519ccb90b48d077ea6d6b5fefd0294",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 22614,
"upload_time": "2023-10-20T15:29:06",
"upload_time_iso_8601": "2023-10-20T15:29:06.886398Z",
"url": "https://files.pythonhosted.org/packages/17/f7/b40bacdff49aa7864dd88a814fbb4e319ff7c5ffb70b7a46797eeda1d066/amazon-textract-geofinder-0.0.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-20 15:29:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "aws-samples",
"github_project": "amazon-textract-textractor",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "amazon-textract-geofinder"
}