form-analyzer

Name	form-analyzer JSON
Version	0.1.2 JSON
	download
home_page	https://github.com/futsch1/form-analyzer
Summary	Python package to analyze scanned questionnaires and forms with AWS Textract and convert the results to an XLSX.
upload_time	2023-10-28 11:32:10
maintainer
docs_url	None
author	Florian Fetz
requires_python	>=3.7
license	MIT
keywords	textract aws form questionnaire xlsx excel
VCS
bugtrack_url
requirements	boto3 amazon-textract-caller amazon-textract-response-parser pdf2image openpyxl coverage
Travis-CI	No Travis.
coveralls test coverage

            # form-analyzer - A library that uses AWS Textract to automatically evaluate filled forms

[![Build](https://github.com/Futsch1/form-analyzer/actions/workflows/build.yml/badge.svg)](https://github.com/Futsch1/form-analyzer/actions/workflows/build.yml)
[![Documentation Status](https://readthedocs.org/projects/form-analyzer/badge/?version=latest)](https://form-analyzer.readthedocs.io/en/latest/?badge=latest)
[![Coverage Status](https://coveralls.io/repos/github/Futsch1/form-analyzer/badge.svg?branch=main)](https://coveralls.io/github/Futsch1/form-analyzer?branch=main)
[![Maintainability](https://api.codeclimate.com/v1/badges/743708a08f4e8fd7bf7e/maintainability)](https://codeclimate.com/github/Futsch1/form-analyzer/maintainability)

Python package to analyze scanned questionnaires and forms with AWS Textract and convert the results to an XLSX.

No thorough Python programming abilities are required, but a basic understanding is needed.

## Prerequisites

- Install form-analyzer using pip

```
pip install form-analyzer
```

- Get an AWS account and [create an access key](https://aws.amazon.com/premiumsupport/knowledge-center/create-access-key/)
- If your scanned questionnaires are in PDF format, install the required tools
  for [pdf2image](https://pypi.org/project/pdf2image/)

## Example

For a comprehensive example, see the 
[example folder in this project](https://github.com/Futsch1/form-analyzer/tree/main/example)

## Prepare questionnaires

In order to process your input data, the questionnaires need to be converted to a proper format.
form-analyzer requires PNG files for the upload to AWS Textract. If your data is already in this
format, make sure that their lexicographic order corresponds to the number of pages in your form.

Example:

```
Form1_Page1.png
Form1_Page2.png
Form1_Page3.png
Form2_Page1.png
Form2_Page2.png
Form2_Page3.png
```

### Convert PDF files

form-analyzer can convert PDF input files to properly named PNG files ready for upload. Each PDF
page can optionally be post-processed by a custom function to split pages.

Create a Python script like this to convert single page PDF files (assuming that the PDFs are located
in the folder "questionnaires"):

```python
import form_analyzer

form_analyzer.pdf_to_image('questionnaires')
```

The following example shows how to split a single PDF page into two images and how to return only the first page:

```python
import form_analyzer


def one_page_to_two(_: int, image):
    left = image.crop((0, 0, image.width // 2, image.height))
    right = image.crop((image.width // 2, 0, image.width, image.height))

    return [form_analyzer.ProcessedImage(left, '_1'), form_analyzer.ProcessedImage(right, '_2')]


form_analyzer.pdf_to_image('questionnaires', image_processor=one_page_to_two)

form_analyzer.pdf_to_image('questionnaires', 
                           image_processor=lambda image_index, image: [form_analyzer.ProcessedImage(image, '') if image_index == 0 else None])
```

The argument image_processor specifies a function that receives the current PDF page number (starting with 0)
and an [Image](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image) object.
It returns a list of form_analyzer.ProcessedImage objects that contain an Image object and a file name suffix. The list may also contain `None`, in which case the entry is skipped.

The resulting images are stored in the same folder as the PDF source files.

## AWS Textract

The converted images can now be processed by AWS Textract to extract the form data. You can either
provide your AWS access key and region as parameters or set them up according to
[this manual](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html).

It is also possible to upload the images to an AWS S3 bucket and analyze them from there. If that's
desired, pass the S3 bucket name and an optional sub folder.

Assuming that the credentials are already set, this script will upload and process the data.

```python
import form_analyzer

form_analyzer.run_textract('questionnaires')
```

The result data is saved as JSON files in the target folder. Before using AWS Textract, the
function checks if result data is already present. If that is the case, the Textract call is skipped.

### Work with Textract only

If you do not need the form processing, you can also directly use the generated JSON files with [Textract Response Parser](https://pypi.org/project/amazon-textract-response-parser/).

```python
import glob
import json
import trp

for file_name in glob.glob('*.json'):
    with open(file_name) as f:
        doc = trp.Document([json.load(f)])

    for block in doc.blocks[0]['Blocks']:
        print(block.get('Text'))
```

## Form description

In order to convert your form to a meaningful Excel file, form-analyzer needs to know the expected
form fields. A description has to be provided as a Python module.

This module needs to contain two variables:

- form_fields: The list of form fields
- keywords_per_page: A list of keywords to expect on each page

### form_fields variable

This variable is a list of FormField objects, which each describes a single field in the form. Each
FormField object consists of a title and a Selector object. The title is the column header in the Excel
file and the Selector defines the type of the form field and its location.

**_Important_**:
Note that the form description greatly affects the result of the form analyzing process. The AWS
Textract process often has slight errors and does not yield 100% correct results. The form descriptions
needs to account for that and on the one hand provide a detailed description of where to look for
form fields and on the other hand needs to keep search strings generic to help to detect the correct
field.

#### Selectors

Some selectors require a key and all require filter for initialization. The key is the label
of the form field which is searched in the extracted form data. It is recommended to not
indicate the full label but a unique part of it to compensate for potential detection errors.

- SingleSelect: Describes a list of checkboxes where only one may be marked
- MultiSelect: Describes a list of checkboxes where none, one or several may be marked
- TextField: Describes a text input box or input line where free text can be entered
- TextFieldWithCheckbox: Describes a text input field with an additional checkbox
- Number: Special case of TextField where only numbers may be entered
- Placeholder: Results in an empty column in the Excel file

For single and multi selects, additional and alternative text fields can be given. The 
content of the additional field is always added to the output and can be used to handle
optional free text fields. The alternative text field is used when no selection is made.
Both additional and alternative fields can be either TextField, Number or 
TextFieldWithCheckbox.

Note that all text matching will be done case-insensitive and with a certain fuzziness, so that
no exact match is required.

See also [the documentation](https://form-analyzer.readthedocs.io/en/latest/selectors.html).

#### Filters

Filters restrict the extracted form fields to search for the current form field. The lower the number
of potential extracted form fields, the higher the probability of correct results.

Filters can be combined using the & (and) and | (or) operator.

- Page: Restricts the search to a certain page (page numbers starting with 0, so 0 is the first page)
- Pages: Restricts the search to a list of pages
- Location: Restricts the search to a part of the page indicated by horizontal and vertical ranges as page fractions.
- Selected: Restricts the search to fields which are selected checkboxes

Location filters apply to all selection possibilities for single and multi selects and to the label
for text and number fields.

Note that when working with location filters and scanned form pages, the position of certain fields on
the page must be similar for each scan.

See also [the documentation](https://form-analyzer.readthedocs.io/en/latest/filters.html).

#### Examples

```python
from form_analyzer.filters import *
from form_analyzer.selectors import *

# Single select on the first page with two options
single_select = SingleSelect(['First option', 'Second option'], 
                             Page(0))

# Multi select on the top half of the first page
multi_select = MultiSelect(['First option', 'Second option'],
                           Page(0) & Location(vertical=(.0, .5)))

# Text field on the upper left quarter of the first page
text_field = TextField('Field label',
                       Page(0) & Location(horizontal=(.0, .5), vertical=(.0, .5)))

# Single select on the lowest third of the second page or the top half of the third page
single_select_2 = SingleSelect(['First option', 'Second option', 'Third option'],
                               (Page(1) & Location(vertical=(.66, 1))) |
                               (Page(2) & Location(vertical=(.0, .5))))
```

### Keywords per page

The variable keywords_per_page in the form description is used to validate that a correct form is 
being analyzed. It is a list of a list of strings. For each page, a list of strings can be given 
where at least one of them has to be found in the strings discovered by Textract on the page.

If the list is empty or empty for a single page, no validation is performed.

Example

```python
# Will search for 'welcome' on the first page and for 'future' or 'past' on the second
keywords_per_page = [['welcome'], ['future', 'past']]
```

## Form analysis

The data returned from AWS Textract and the form description are the inputs for the final
analysis step that will try to locate all described form fields, get their value in the respective
filled forms and put this in an Excel file.

To run the analysis, use the following where the AWS Textract JSON files and PNGs are located
in the folder "questionnaires" and a Python module "my_form" exists in the Python search path 
that contains the form description (this should usually be the current folder, where a "my_form.py" is 
located). You can optionally pass the name of the resulting Excel file.

```python
import form_analyzer

form_analyzer.analyze('questionnaires', 'my_form', 'my_form_results')
```

### Results

After analyzing, an Excel file is created. The first column always contains a link to the image of the 
first page of the form. Each uncertain field (meaning that there was some uncertainty during the 
analysis and the result might be incorrect) is also linked to the image of the page where the field
is located.

Usually, it is required to manually check the results. The Excel file is not perfect and depending
on the complexity of the form, the quality of the inputs, the PDF quality etc. the file might contain
errors. The number of found uncertain fields is printed after the analysis and can be used as a coarse
measure for the quality of the results.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/futsch1/form-analyzer",
    "name": "form-analyzer",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "textract,AWS,form,questionnaire,xlsx,excel",
    "author": "Florian Fetz",
    "author_email": "florian.fetz@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/e6/d1/e0a5f3d42ea752378b26582dc77d044ac659188193105ebeeb572646dbb8/form-analyzer-0.1.2.tar.gz",
    "platform": null,
    "description": "# form-analyzer - A library that uses AWS Textract to automatically evaluate filled forms\n\n[![Build](https://github.com/Futsch1/form-analyzer/actions/workflows/build.yml/badge.svg)](https://github.com/Futsch1/form-analyzer/actions/workflows/build.yml)\n[![Documentation Status](https://readthedocs.org/projects/form-analyzer/badge/?version=latest)](https://form-analyzer.readthedocs.io/en/latest/?badge=latest)\n[![Coverage Status](https://coveralls.io/repos/github/Futsch1/form-analyzer/badge.svg?branch=main)](https://coveralls.io/github/Futsch1/form-analyzer?branch=main)\n[![Maintainability](https://api.codeclimate.com/v1/badges/743708a08f4e8fd7bf7e/maintainability)](https://codeclimate.com/github/Futsch1/form-analyzer/maintainability)\n\nPython package to analyze scanned questionnaires and forms with AWS Textract and convert the results to an XLSX.\n\nNo thorough Python programming abilities are required, but a basic understanding is needed.\n\n## Prerequisites\n\n- Install form-analyzer using pip\n\n```\npip install form-analyzer\n```\n\n- Get an AWS account and [create an access key](https://aws.amazon.com/premiumsupport/knowledge-center/create-access-key/)\n- If your scanned questionnaires are in PDF format, install the required tools\n  for [pdf2image](https://pypi.org/project/pdf2image/)\n\n## Example\n\nFor a comprehensive example, see the \n[example folder in this project](https://github.com/Futsch1/form-analyzer/tree/main/example)\n\n## Prepare questionnaires\n\nIn order to process your input data, the questionnaires need to be converted to a proper format.\nform-analyzer requires PNG files for the upload to AWS Textract. If your data is already in this\nformat, make sure that their lexicographic order corresponds to the number of pages in your form.\n\nExample:\n\n```\nForm1_Page1.png\nForm1_Page2.png\nForm1_Page3.png\nForm2_Page1.png\nForm2_Page2.png\nForm2_Page3.png\n```\n\n### Convert PDF files\n\nform-analyzer can convert PDF input files to properly named PNG files ready for upload. Each PDF\npage can optionally be post-processed by a custom function to split pages.\n\nCreate a Python script like this to convert single page PDF files (assuming that the PDFs are located\nin the folder \"questionnaires\"):\n\n```python\nimport form_analyzer\n\nform_analyzer.pdf_to_image('questionnaires')\n```\n\nThe following example shows how to split a single PDF page into two images and how to return only the first page:\n\n```python\nimport form_analyzer\n\n\ndef one_page_to_two(_: int, image):\n    left = image.crop((0, 0, image.width // 2, image.height))\n    right = image.crop((image.width // 2, 0, image.width, image.height))\n\n    return [form_analyzer.ProcessedImage(left, '_1'), form_analyzer.ProcessedImage(right, '_2')]\n\n\nform_analyzer.pdf_to_image('questionnaires', image_processor=one_page_to_two)\n\nform_analyzer.pdf_to_image('questionnaires', \n                           image_processor=lambda image_index, image: [form_analyzer.ProcessedImage(image, '') if image_index == 0 else None])\n```\n\nThe argument image_processor specifies a function that receives the current PDF page number (starting with 0)\nand an [Image](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image) object.\nIt returns a list of form_analyzer.ProcessedImage objects that contain an Image object and a file name suffix. The list may also contain `None`, in which case the entry is skipped.\n\nThe resulting images are stored in the same folder as the PDF source files.\n\n## AWS Textract\n\nThe converted images can now be processed by AWS Textract to extract the form data. You can either\nprovide your AWS access key and region as parameters or set them up according to\n[this manual](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html).\n\nIt is also possible to upload the images to an AWS S3 bucket and analyze them from there. If that's\ndesired, pass the S3 bucket name and an optional sub folder.\n\nAssuming that the credentials are already set, this script will upload and process the data.\n\n```python\nimport form_analyzer\n\nform_analyzer.run_textract('questionnaires')\n```\n\nThe result data is saved as JSON files in the target folder. Before using AWS Textract, the\nfunction checks if result data is already present. If that is the case, the Textract call is skipped.\n\n### Work with Textract only\n\nIf you do not need the form processing, you can also directly use the generated JSON files with [Textract Response Parser](https://pypi.org/project/amazon-textract-response-parser/).\n\n```python\nimport glob\nimport json\nimport trp\n\nfor file_name in glob.glob('*.json'):\n    with open(file_name) as f:\n        doc = trp.Document([json.load(f)])\n\n    for block in doc.blocks[0]['Blocks']:\n        print(block.get('Text'))\n```\n\n## Form description\n\nIn order to convert your form to a meaningful Excel file, form-analyzer needs to know the expected\nform fields. A description has to be provided as a Python module.\n\nThis module needs to contain two variables:\n\n- form_fields: The list of form fields\n- keywords_per_page: A list of keywords to expect on each page\n\n### form_fields variable\n\nThis variable is a list of FormField objects, which each describes a single field in the form. Each\nFormField object consists of a title and a Selector object. The title is the column header in the Excel\nfile and the Selector defines the type of the form field and its location.\n\n**_Important_**:\nNote that the form description greatly affects the result of the form analyzing process. The AWS\nTextract process often has slight errors and does not yield 100% correct results. The form descriptions\nneeds to account for that and on the one hand provide a detailed description of where to look for\nform fields and on the other hand needs to keep search strings generic to help to detect the correct\nfield.\n\n#### Selectors\n\nSome selectors require a key and all require filter for initialization. The key is the label\nof the form field which is searched in the extracted form data. It is recommended to not\nindicate the full label but a unique part of it to compensate for potential detection errors.\n\n- SingleSelect: Describes a list of checkboxes where only one may be marked\n- MultiSelect: Describes a list of checkboxes where none, one or several may be marked\n- TextField: Describes a text input box or input line where free text can be entered\n- TextFieldWithCheckbox: Describes a text input field with an additional checkbox\n- Number: Special case of TextField where only numbers may be entered\n- Placeholder: Results in an empty column in the Excel file\n\nFor single and multi selects, additional and alternative text fields can be given. The \ncontent of the additional field is always added to the output and can be used to handle\noptional free text fields. The alternative text field is used when no selection is made.\nBoth additional and alternative fields can be either TextField, Number or \nTextFieldWithCheckbox.\n\nNote that all text matching will be done case-insensitive and with a certain fuzziness, so that\nno exact match is required.\n\nSee also [the documentation](https://form-analyzer.readthedocs.io/en/latest/selectors.html).\n\n#### Filters\n\nFilters restrict the extracted form fields to search for the current form field. The lower the number\nof potential extracted form fields, the higher the probability of correct results.\n\nFilters can be combined using the & (and) and | (or) operator.\n\n- Page: Restricts the search to a certain page (page numbers starting with 0, so 0 is the first page)\n- Pages: Restricts the search to a list of pages\n- Location: Restricts the search to a part of the page indicated by horizontal and vertical ranges as page fractions.\n- Selected: Restricts the search to fields which are selected checkboxes\n\nLocation filters apply to all selection possibilities for single and multi selects and to the label\nfor text and number fields.\n\nNote that when working with location filters and scanned form pages, the position of certain fields on\nthe page must be similar for each scan.\n\nSee also [the documentation](https://form-analyzer.readthedocs.io/en/latest/filters.html).\n\n#### Examples\n\n```python\nfrom form_analyzer.filters import *\nfrom form_analyzer.selectors import *\n\n# Single select on the first page with two options\nsingle_select = SingleSelect(['First option', 'Second option'], \n                             Page(0))\n\n# Multi select on the top half of the first page\nmulti_select = MultiSelect(['First option', 'Second option'],\n                           Page(0) & Location(vertical=(.0, .5)))\n\n# Text field on the upper left quarter of the first page\ntext_field = TextField('Field label',\n                       Page(0) & Location(horizontal=(.0, .5), vertical=(.0, .5)))\n\n# Single select on the lowest third of the second page or the top half of the third page\nsingle_select_2 = SingleSelect(['First option', 'Second option', 'Third option'],\n                               (Page(1) & Location(vertical=(.66, 1))) |\n                               (Page(2) & Location(vertical=(.0, .5))))\n```\n\n### Keywords per page\n\nThe variable keywords_per_page in the form description is used to validate that a correct form is \nbeing analyzed. It is a list of a list of strings. For each page, a list of strings can be given \nwhere at least one of them has to be found in the strings discovered by Textract on the page.\n\nIf the list is empty or empty for a single page, no validation is performed.\n\nExample\n\n```python\n# Will search for 'welcome' on the first page and for 'future' or 'past' on the second\nkeywords_per_page = [['welcome'], ['future', 'past']]\n```\n\n## Form analysis\n\nThe data returned from AWS Textract and the form description are the inputs for the final\nanalysis step that will try to locate all described form fields, get their value in the respective\nfilled forms and put this in an Excel file.\n\nTo run the analysis, use the following where the AWS Textract JSON files and PNGs are located\nin the folder \"questionnaires\" and a Python module \"my_form\" exists in the Python search path \nthat contains the form description (this should usually be the current folder, where a \"my_form.py\" is \nlocated). You can optionally pass the name of the resulting Excel file.\n\n```python\nimport form_analyzer\n\nform_analyzer.analyze('questionnaires', 'my_form', 'my_form_results')\n```\n\n### Results\n\nAfter analyzing, an Excel file is created. The first column always contains a link to the image of the \nfirst page of the form. Each uncertain field (meaning that there was some uncertainty during the \nanalysis and the result might be incorrect) is also linked to the image of the page where the field\nis located.\n\nUsually, it is required to manually check the results. The Excel file is not perfect and depending\non the complexity of the form, the quality of the inputs, the PDF quality etc. the file might contain\nerrors. The number of found uncertain fields is printed after the analysis and can be used as a coarse\nmeasure for the quality of the results.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Python package to analyze scanned questionnaires and forms with AWS Textract and convert the results to an XLSX.",
    "version": "0.1.2",
    "project_urls": {
        "Documentation": "http://form-analyzer.rtfd.io",
        "Homepage": "https://github.com/futsch1/form-analyzer"
    },
    "split_keywords": [
        "textract",
        "aws",
        "form",
        "questionnaire",
        "xlsx",
        "excel"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "25cbbd71785092b35780834248092deaee749d2d7a8b1d878c0a36ca5947c52b",
                "md5": "f2b55f5e1c0435c0dabc8db7276a50a3",
                "sha256": "4e76561088bb592112a69d49816328741d7cd0f3296489d13f6474bf13c4a6c0"
            },
            "downloads": -1,
            "filename": "form_analyzer-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f2b55f5e1c0435c0dabc8db7276a50a3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 19578,
            "upload_time": "2023-10-28T11:32:08",
            "upload_time_iso_8601": "2023-10-28T11:32:08.690759Z",
            "url": "https://files.pythonhosted.org/packages/25/cb/bd71785092b35780834248092deaee749d2d7a8b1d878c0a36ca5947c52b/form_analyzer-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e6d1e0a5f3d42ea752378b26582dc77d044ac659188193105ebeeb572646dbb8",
                "md5": "0925014a446b8bbb92077f2cd0ab8eff",
                "sha256": "0919250bfc4d04b8bb3333d85e00282b6d25ba395b70cc4fd94a70098b971e9e"
            },
            "downloads": -1,
            "filename": "form-analyzer-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "0925014a446b8bbb92077f2cd0ab8eff",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 19763,
            "upload_time": "2023-10-28T11:32:10",
            "upload_time_iso_8601": "2023-10-28T11:32:10.290275Z",
            "url": "https://files.pythonhosted.org/packages/e6/d1/e0a5f3d42ea752378b26582dc77d044ac659188193105ebeeb572646dbb8/form-analyzer-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-28 11:32:10",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "futsch1",
    "github_project": "form-analyzer",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "requirements": [
        {
            "name": "boto3",
            "specs": []
        },
        {
            "name": "amazon-textract-caller",
            "specs": []
        },
        {
            "name": "amazon-textract-response-parser",
            "specs": []
        },
        {
            "name": "pdf2image",
            "specs": []
        },
        {
            "name": "openpyxl",
            "specs": []
        },
        {
            "name": "coverage",
            "specs": []
        }
    ],
    "lcname": "form-analyzer"
}

Florian Fetz