amazon-textract-textractor

Name	amazon-textract-textractor JSON
Version	1.8.5 JSON
	download
home_page	https://github.com/aws-samples/amazon-textract-textractor
Summary	A package to use AWS Textract services.
upload_time	2024-11-13 14:56:19
maintainer	None
docs_url	None
author	None
requires_python	None
license	Apache 2.0
keywords	amazon textract aws ocr document
VCS
bugtrack_url
requirements	amazon-textract-caller Pillow tabulate XlsxWriter editdistance
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ![Textractor](https://raw.githubusercontent.com/aws-samples/amazon-textract-textractor/5716c52e8a39c063f43e058e1637e4984a4b2da4/docs/source/textractor_cropped.png)

[![Tests](https://github.com/aws-samples/amazon-textract-textractor/actions/workflows/tests.yml/badge.svg)](https://github.com/aws-samples/amazon-textract-textractor/actions/workflows/tests.yml) [![Documentation](https://github.com/aws-samples/amazon-textract-textractor/actions/workflows/documentation.yml/badge.svg)](https://aws-samples.github.io/amazon-textract-textractor/) [![PyPI version](https://badge.fury.io/py/amazon-textract-textractor.svg)](https://pypi.org/project/amazon-textract-textractor/) [![Downloads](https://static.pepy.tech/badge/amazon-textract-textractor/month)](https://pepy.tech/project/amazon-textract-textractor) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

**Textractor** is a python package created to seamlessly work with [Amazon Textract](https://docs.aws.amazon.com/textract/latest/dg/what-is.html) a document intelligence service offering text recognition, table extraction, form processing, and much more. Whether you are making a one-off script or a complex distributed document processing pipeline, Textractor makes it easy to use Textract.

If you are looking for the other amazon-textract-* packages, you can find them using the links below:

- [amazon-textract-caller](https://github.com/aws-samples/amazon-textract-textractor/tree/master/caller) (to simplify calling Amazon Textract without additional dependencies)
- [amazon-textract-response-parser](https://pypi.org/project/amazon-textract-response-parser/) (to parse the JSON response returned by Textract APIs)
- [amazon-textract-overlayer](https://github.com/aws-samples/amazon-textract-textractor/tree/master/overlayer) (to draw bounding boxes around the document entities on the document image)
- [amazon-textract-prettyprinter](https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter) (convert Amazon Textract response to CSV, text, markdown, ...)
- [amazon-textract-geofinder](https://github.com/aws-samples/amazon-textract-textractor/tree/master/tpipelinegeofinder) (extract specific information from document with methods that help navigate the document using geometry and relations, e. g. hierarchical key/value pairs)

## Installation

Textractor is available on PyPI and can be installed with `pip install amazon-textract-textractor`. By default this will install the minimal version of Textractor which is suitable for lambda execution. The following extras can be used to add features:

- `pandas` (`pip install "amazon-textract-textractor[pandas]"`) installs pandas which is used to enable DataFrame and CSV exports.
- `pdfium` (`pip install amazon-textract-textractor[pdfium]`) includes `pypdfium2` and is the recommended way to enable PDF rasterization in Textractor. Note that this is **not** necessary to call Textract with a PDF file.
- `pdf` (`pip install amazon-textract-textractor[pdf]`) includes `pdf2image` and is an additional way to enable PDF rasterization in Textractor. Note that this is **not** necessary to call Textract with a PDF file.
- `torch` (`pip install "amazon-textract-textractor[torch]"`) includes `sentence_transformers` for better word search and matching. This will work on CPU but be noticeably slower than non-machine learning based approaches.
- `dev` (`pip install "amazon-textract-textractor[dev]"`) includes all the dependencies above and everything else needed to test the code.

You can pick several extras by separating the labels with commas like this `pip install "amazon-textract-textractor[pdf,torch]"`.

## Documentation

Generated documentation for the latest released version can be accessed here: [aws-samples.github.io/amazon-textract-textractor/](https://aws-samples.github.io/amazon-textract-textractor/)

## Examples

While a collection of simplistic examples is presented here, the documentation has a much [larger collection of examples](https://aws-samples.github.io/amazon-textract-textractor/examples.html) with specific case studies that will help you get started. 

### Setup

These two lines are all you need to use Textract. The Textractor instance can be reused across multiple requests for both synchronous and asynchronous requests.

```py
from textractor import Textractor

extractor = Textractor(profile_name="default")
```

### Text recognition

```py
# file_source can be an image, list of images, bytes or S3 path
document = extractor.detect_document_text(file_source="tests/fixtures/single-page-1.png")
print(document.lines)
#[Textractor Test, Document, Page (1), Key - Values, Name of package: Textractor, Date : 08/14/2022, Table 1, Cell 1, Cell 2, Cell 4, Cell 5, Cell 6, Cell 7, Cell 8, Cell 9, Cell 10, Cell 11, Cell 12, Cell 13, Cell 14, Cell 15, Selection Element, Selected Checkbox, Un-Selected Checkbox]
```

### Table extraction

```py
from textractor.data.constants import TextractFeatures

document = extractor.analyze_document(
	file_source="tests/fixtures/form.png",
	features=[TextractFeatures.TABLES]
)
# Saves the table in an excel document for further processing
document.tables[0].to_excel("output.xlsx")
```

### Form extraction

```py
from textractor.data.constants import TextractFeatures

document = extractor.analyze_document(
	file_source="tests/fixtures/form.png",
	features=[TextractFeatures.FORMS]
)
# Use document.get() to search for a key with fuzzy matching
document.get("email")
# [E-mail Address : johndoe@gmail.com]
```

### Analyze ID

```py
document = extractor.analyze_id(file_source="tests/fixtures/fake_id.png")
print(document.identity_documents[0].get("FIRST_NAME"))
# 'MARIA'
```

### Receipt processing (Analyze Expense)

```py
document = extractor.analyze_expense(file_source="tests/fixtures/receipt.jpg")
print(document.expense_documents[0].summary_fields.get("TOTAL")[0].text)
# '$1810.46'
```

If your use case was not covered here or if you are looking for asynchronous usage examples, see [our collection of examples](https://aws-samples.github.io/amazon-textract-textractor/examples.html).

## CLI

Textractor also comes with the `textractor` script, which supports calling, printing and overlaying directly in the terminal. 

`textractor analyze-document tests/fixtures/amzn_q2.png output.json --features TABLES --overlay TABLES`

![overlay_example](images/amzn.png)

See [the documentation](https://aws-samples.github.io/amazon-textract-textractor/commandline.html) for more examples.

## Tests

The package comes with tests that call the production Textract APIs. Running the tests will incur charges to your AWS account.

## Acknowledgements

This library was made possible by the work of Srividhya Radhakrishna ([@srividh-r](https://github.com/srividh-r)).

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md)

## Citing

Textractor can be cited using:

```
@software{amazontextractor,
  author = {Belval, Edouard and Delteil, Thomas and Schade, Martin and Radhakrishna, Srividhya},
  title = {{Amazon Textractor}},
  url = {https://github.com/aws-samples/amazon-textract-textractor},
  version = {1.8.5},
  year = {2024}
}
```

Or using the CITATION.cff file. 

## License

This library is licensed under the Apache 2.0 License.

<sub><sup>Excavator image by macrovector on Freepik</sub></sup>

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/aws-samples/amazon-textract-textractor",
    "name": "amazon-textract-textractor",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "amazon textract aws ocr document",
    "author": null,
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/5a/4b/4b68c49bdb31db2994e52edc634372b5b2d0aa7f5d45a5532505e287f467/amazon-textract-textractor-1.8.5.tar.gz",
    "platform": null,
    "description": "![Textractor](https://raw.githubusercontent.com/aws-samples/amazon-textract-textractor/5716c52e8a39c063f43e058e1637e4984a4b2da4/docs/source/textractor_cropped.png)\n\n[![Tests](https://github.com/aws-samples/amazon-textract-textractor/actions/workflows/tests.yml/badge.svg)](https://github.com/aws-samples/amazon-textract-textractor/actions/workflows/tests.yml) [![Documentation](https://github.com/aws-samples/amazon-textract-textractor/actions/workflows/documentation.yml/badge.svg)](https://aws-samples.github.io/amazon-textract-textractor/) [![PyPI version](https://badge.fury.io/py/amazon-textract-textractor.svg)](https://pypi.org/project/amazon-textract-textractor/) [![Downloads](https://static.pepy.tech/badge/amazon-textract-textractor/month)](https://pepy.tech/project/amazon-textract-textractor) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n**Textractor** is a python package created to seamlessly work with [Amazon Textract](https://docs.aws.amazon.com/textract/latest/dg/what-is.html) a document intelligence service offering text recognition, table extraction, form processing, and much more. Whether you are making a one-off script or a complex distributed document processing pipeline, Textractor makes it easy to use Textract.\n\nIf you are looking for the other amazon-textract-* packages, you can find them using the links below:\n\n- [amazon-textract-caller](https://github.com/aws-samples/amazon-textract-textractor/tree/master/caller) (to simplify calling Amazon Textract without additional dependencies)\n- [amazon-textract-response-parser](https://pypi.org/project/amazon-textract-response-parser/) (to parse the JSON response returned by Textract APIs)\n- [amazon-textract-overlayer](https://github.com/aws-samples/amazon-textract-textractor/tree/master/overlayer) (to draw bounding boxes around the document entities on the document image)\n- [amazon-textract-prettyprinter](https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter) (convert Amazon Textract response to CSV, text, markdown, ...)\n- [amazon-textract-geofinder](https://github.com/aws-samples/amazon-textract-textractor/tree/master/tpipelinegeofinder) (extract specific information from document with methods that help navigate the document using geometry and relations, e. g. hierarchical key/value pairs)\n\n## Installation\n\nTextractor is available on PyPI and can be installed with `pip install amazon-textract-textractor`. By default this will install the minimal version of Textractor which is suitable for lambda execution. The following extras can be used to add features:\n\n- `pandas` (`pip install \"amazon-textract-textractor[pandas]\"`) installs pandas which is used to enable DataFrame and CSV exports.\n- `pdfium` (`pip install amazon-textract-textractor[pdfium]`) includes `pypdfium2` and is the recommended way to enable PDF rasterization in Textractor. Note that this is **not** necessary to call Textract with a PDF file.\n- `pdf` (`pip install amazon-textract-textractor[pdf]`) includes `pdf2image` and is an additional way to enable PDF rasterization in Textractor. Note that this is **not** necessary to call Textract with a PDF file.\n- `torch` (`pip install \"amazon-textract-textractor[torch]\"`) includes `sentence_transformers` for better word search and matching. This will work on CPU but be noticeably slower than non-machine learning based approaches.\n- `dev` (`pip install \"amazon-textract-textractor[dev]\"`) includes all the dependencies above and everything else needed to test the code.\n\nYou can pick several extras by separating the labels with commas like this `pip install \"amazon-textract-textractor[pdf,torch]\"`.\n\n## Documentation\n\nGenerated documentation for the latest released version can be accessed here: [aws-samples.github.io/amazon-textract-textractor/](https://aws-samples.github.io/amazon-textract-textractor/)\n\n## Examples\n\nWhile a collection of simplistic examples is presented here, the documentation has a much [larger collection of examples](https://aws-samples.github.io/amazon-textract-textractor/examples.html) with specific case studies that will help you get started. \n\n### Setup\n\nThese two lines are all you need to use Textract. The Textractor instance can be reused across multiple requests for both synchronous and asynchronous requests.\n\n```py\nfrom textractor import Textractor\n\nextractor = Textractor(profile_name=\"default\")\n```\n\n### Text recognition\n\n```py\n# file_source can be an image, list of images, bytes or S3 path\ndocument = extractor.detect_document_text(file_source=\"tests/fixtures/single-page-1.png\")\nprint(document.lines)\n#[Textractor Test, Document, Page (1), Key - Values, Name of package: Textractor, Date : 08/14/2022, Table 1, Cell 1, Cell 2, Cell 4, Cell 5, Cell 6, Cell 7, Cell 8, Cell 9, Cell 10, Cell 11, Cell 12, Cell 13, Cell 14, Cell 15, Selection Element, Selected Checkbox, Un-Selected Checkbox]\n```\n\n### Table extraction\n\n```py\nfrom textractor.data.constants import TextractFeatures\n\ndocument = extractor.analyze_document(\n\tfile_source=\"tests/fixtures/form.png\",\n\tfeatures=[TextractFeatures.TABLES]\n)\n# Saves the table in an excel document for further processing\ndocument.tables[0].to_excel(\"output.xlsx\")\n```\n\n### Form extraction\n\n```py\nfrom textractor.data.constants import TextractFeatures\n\ndocument = extractor.analyze_document(\n\tfile_source=\"tests/fixtures/form.png\",\n\tfeatures=[TextractFeatures.FORMS]\n)\n# Use document.get() to search for a key with fuzzy matching\ndocument.get(\"email\")\n# [E-mail Address : johndoe@gmail.com]\n```\n\n### Analyze ID\n\n```py\ndocument = extractor.analyze_id(file_source=\"tests/fixtures/fake_id.png\")\nprint(document.identity_documents[0].get(\"FIRST_NAME\"))\n# 'MARIA'\n```\n\n### Receipt processing (Analyze Expense)\n\n```py\ndocument = extractor.analyze_expense(file_source=\"tests/fixtures/receipt.jpg\")\nprint(document.expense_documents[0].summary_fields.get(\"TOTAL\")[0].text)\n# '$1810.46'\n```\n\nIf your use case was not covered here or if you are looking for asynchronous usage examples, see [our collection of examples](https://aws-samples.github.io/amazon-textract-textractor/examples.html).\n\n## CLI\n\nTextractor also comes with the `textractor` script, which supports calling, printing and overlaying directly in the terminal. \n\n`textractor analyze-document tests/fixtures/amzn_q2.png output.json --features TABLES --overlay TABLES`\n\n![overlay_example](images/amzn.png)\n\nSee [the documentation](https://aws-samples.github.io/amazon-textract-textractor/commandline.html) for more examples.\n\n## Tests\n\nThe package comes with tests that call the production Textract APIs. Running the tests will incur charges to your AWS account.\n\n## Acknowledgements\n\nThis library was made possible by the work of Srividhya Radhakrishna ([@srividh-r](https://github.com/srividh-r)).\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md)\n\n## Citing\n\nTextractor can be cited using:\n\n```\n@software{amazontextractor,\n  author = {Belval, Edouard and Delteil, Thomas and Schade, Martin and Radhakrishna, Srividhya},\n  title = {{Amazon Textractor}},\n  url = {https://github.com/aws-samples/amazon-textract-textractor},\n  version = {1.8.5},\n  year = {2024}\n}\n```\n\nOr using the CITATION.cff file. \n\n## License\n\nThis library is licensed under the Apache 2.0 License.\n\n<sub><sup>Excavator image by macrovector on Freepik</sub></sup>\n\n\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "A package to use AWS Textract services.",
    "version": "1.8.5",
    "project_urls": {
        "Homepage": "https://github.com/aws-samples/amazon-textract-textractor"
    },
    "split_keywords": [
        "amazon",
        "textract",
        "aws",
        "ocr",
        "document"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "932db1cf5e1b408d7220ec3ac8b397bebd12ce03a060d0c50d1bb956120b1e1c",
                "md5": "b1de1e8f45c6fcff83a9a0b6f4566f67",
                "sha256": "f6b3b2399d20694d2ddee630e220c7986ecdef1d93a3a34cf1f2a5b3713ccf74"
            },
            "downloads": -1,
            "filename": "amazon_textract_textractor-1.8.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b1de1e8f45c6fcff83a9a0b6f4566f67",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 309346,
            "upload_time": "2024-11-13T14:56:17",
            "upload_time_iso_8601": "2024-11-13T14:56:17.699449Z",
            "url": "https://files.pythonhosted.org/packages/93/2d/b1cf5e1b408d7220ec3ac8b397bebd12ce03a060d0c50d1bb956120b1e1c/amazon_textract_textractor-1.8.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5a4b4b68c49bdb31db2994e52edc634372b5b2d0aa7f5d45a5532505e287f467",
                "md5": "6299cd2db87bb5e7307c300d18f777bf",
                "sha256": "d0f5ab90ebdae5084d4280fc9ecb56b6a644b790c33ffd33abceb67c12608f67"
            },
            "downloads": -1,
            "filename": "amazon-textract-textractor-1.8.5.tar.gz",
            "has_sig": false,
            "md5_digest": "6299cd2db87bb5e7307c300d18f777bf",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 290084,
            "upload_time": "2024-11-13T14:56:19",
            "upload_time_iso_8601": "2024-11-13T14:56:19.734322Z",
            "url": "https://files.pythonhosted.org/packages/5a/4b/4b68c49bdb31db2994e52edc634372b5b2d0aa7f5d45a5532505e287f467/amazon-textract-textractor-1.8.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-13 14:56:19",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "aws-samples",
    "github_project": "amazon-textract-textractor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "amazon-textract-caller",
            "specs": [
                [
                    ">=",
                    "0.2.4"
                ],
                [
                    "<",
                    "1"
                ]
            ]
        },
        {
            "name": "Pillow",
            "specs": []
        },
        {
            "name": "tabulate",
            "specs": [
                [
                    "<",
                    "0.10"
                ],
                [
                    ">=",
                    "0.9"
                ]
            ]
        },
        {
            "name": "XlsxWriter",
            "specs": [
                [
                    ">=",
                    "3.0"
                ],
                [
                    "<",
                    "4"
                ]
            ]
        },
        {
            "name": "editdistance",
            "specs": [
                [
                    "<",
                    "0.9"
                ],
                [
                    ">=",
                    "0.6.2"
                ]
            ]
        }
    ],
    "lcname": "amazon-textract-textractor"
}

None