pdf-statement-reader

Name	pdf-statement-reader JSON
Version	0.3.4 JSON
	download
home_page	None
Summary	PDF Statement Reader
upload_time	2025-01-16 10:09:04
maintainer	None
docs_url	None
author	None
requires_python	>=3.13
license	MIT License Copyright (c) 2019 Marlan Perumal Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	bank statement digitise pdf reader statement
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # PDF Statement Reader
[![PyPI version](https://badge.fury.io/py/pdf-statement-reader.svg)](https://badge.fury.io/py/pdf-statement-reader)
[![Coverage Status](https://coveralls.io/repos/github/marlanperumal/pdf_statement_reader/badge.svg)](https://coveralls.io/github/marlanperumal/pdf_statement_reader)

Python library and command line tool for parsing pdf bank statements

Inspired by https://github.com/antonburger/pdf2csv

## Objectives

Banks generally send account statements in pdf format. These pdfs are often encrypted, the pdf format is difficult to extract tables from and when you finally get the table out it's in a non tidy format. This package aims to help by providing a library of functions and a set of command line tools for converting these statements into more useful formats such as csv files and pandas dataframes.

## Installation

Python and package management have been set up with [uv](https://docs.astral.sh/uv/). With uv installed and the repo cloned run

```bash
uv sync
```

The CLI tool can then be invoked with

```bash
uv run psr
```

Alternatively you can use the `uvx` command to run the tool without installing with

```bash
uvx --from pdf-statement-reader psr
```

Or to be less verbose on each call first run

```bash
uv tool install pdf-statement-reader
```

Then you'll be able to simply run

```bash
psr
```

### Troubleshooting

This package uses [tabula-py](https://github.com/chezou/tabula-py) under the hood, which itself is a wrapper for [tabula-java](https://github.com/tabulapdf/tabula-java). You thus need to have java installed for it to work. If you have any errors complaining about java, checkout out the `tabula-py` page for troubleshooting advice.

In the future, we hope to move to a pure python implementation.

## Usage

The package provides a command line application `psr`

```
Usage: psr [OPTIONS] COMMAND [ARGS]...

  Utility for reading bank and other statements in pdf form

Options:
  --help  Show this message and exit.

Commands:
  bulk      Bulk converts all files in a folder
  decrypt   Decrypts a pdf file Uses pikepdf to open an encrypted pdf file...
  pdf2csv   Converts a pdf statement to a csv file using a given format
  validate  Validates the csv statement rolling balance
```

## Configuration

PDF files are notoriously difficult to extract data from. (Here's a nice [blog post](https://www.propublica.org/nerds/heart-of-nerd-darkness-why-dollars-for-docs-was-so-difficult) on why). For a really good semi-manual GUI solution, check out [tabula](https://tabula.technology/). In fact this package uses tabula's pdf parsing library under the hood.

Since bank statements are generally of the same (if inconvenient) format, we can set up a configuration to tell the tool how to grab the data.

For each type of bank statement, the exact format will be different. A config file holds the instructions for how to process the raw pdf. For now the only config supported is for Cheque account statements from Absa bank in South Africa. 

To set up a different statement, you can simply add a new config file and then tell the `psr` tool to use it. These config files are stored in a folder structure as follows:

    config > [country code] > [bank] > [statement type].json

So for example the default config is stored in

    config > za > absa > cheque.json

The config spec is a code of the form

    [country code].[bank].[statement type]

Once again for the default this will be

    za.absa.cheque

The configuration file itself is in JSON format. Here's the Absa cheque account one with some commentary to explain what each field does.

The dimensions to be supplied in the `area` and `columns` parameters are specified in pts, defined as 72 pts in 1 inch. For reference, letter size paper is 8.5 x 11.0 inches (612 x 792 pts) and A4 paper is 8.3 x 11.7 inches (597.6 x 842.4 pts). The origin (0, 0) is located at the top left corner of the page. This is probably most intuitive, however note that it is different to the PDF standard which places the origin at the *bottom* left of the page. 

```json5
{
    "$schema": "https://raw.githubusercontent.com/marlanperumal/pdf_statement_reader/develop/pdf_statement_reader/config/psr_config.schema.json",
    // Describes the page layout that should be scanned
    "layout": { 
        // Default layout for all pages not otherwise defined
        "default": {
            // The page coordinates in containing the table in pts 
            // [top, left, bottom, right]
            "area": [280, 27, 763, 576],
            // The right x coordinate of each column in the table in pts
            "columns": [83, 264, 344, 425, 485, 570]
        },
        // Layout for the first page
        "first": {
            "area": [480, 27, 763, 576],
            "columns": [83, 264, 344, 425, 485, 570]
        }
    },

    // The columns names to be used as they exactly appear
    // in the statement
    "columns": {
        "trans_date": "Date",
        "trans_type": "Transaction Description",
        "trans_detail": "Transaction Detail",
        "debit": "Debit Amount",
        "credit": "Credit Amount",
        "balance": "Balance"
    },

    // The order of the columns to be output in the csv
    "order": [
        "trans_date",
        "trans_type",
        "trans_detail",
        "debit",
        "credit",
        "balance"
    ],

    // Specifies any cleaning operations required
    "cleaning": {
        // Convert these columns to numeric
        "numeric": ["debit", "credit", "balance"],
        // Convert these columns to date
        "date": ["trans_date"],
        // Use this date format to parse any date columns
        "date_format": "%d/%m/%Y",
        // For cases where the transaction detail is stored
        // in the next line below the transaction type
        "trans_detail": "below",
        // Only keep the rows where these columns are populated
        "dropna": ["balance"]
    }
}
```

These were the configuration options that were required for the default format. It is envisaged that as more formats are added, the list of options will grow.

This format is also captured in `pdf_statement_rader/config/psr_config.schema.json` as a [json-schema](https://json-schema.org/understanding-json-schema/index.html). If you're using vscode or some other compatible text editor, you should get autocompletion hints as long as you include that `$schema` tag at the top of your json file.

A key part in setting up a new configuration is getting the page coordinates for the area and columns. The easiest way to do this is to run the [tabula GUI](https://tabula.technology/), autodetect the page areas, save the settings as a template, then download and inspect json template file. It's not a one-to-one mapping to the psr config but hopefully it will be a good starting point.

## CLI API

### decrypt

```
Usage: psr decrypt [OPTIONS] INPUT_FILENAME [OUTPUT_FILENAME]

  Decrypts a pdf file

  Uses pikepdf to open an encrypted pdf file and then save the unencrypted
  version. If no output_filename is specified then overwrites the original
  file.

Options:
  -p, --password TEXT  The pdf encryption password. If not supplied, it will
                       be requested at the prompt
  --help               Show this message and exit.
```

### pdf2csv

```
Usage: psr pdf2csv [OPTIONS] INPUT_FILENAME [OUTPUT_FILENAME]

  Converts a pdf statement to a csv file using a given format

Options:
  -c, --config TEXT  The configuration code defining how the file should be
                     parsed  [default: za.absa.cheque]
  --help             Show this message and exit.
```

### validate

```
Usage: psr validate [OPTIONS] INPUT_FILENAME

  Validates the csv statement rolling balance

Options:
  -c, --config TEXT  The configuration code defining how the file should be
                     parsed  [default: za.absa.cheque]
  --help             Show this message and exit.
```

### bulk

```
Usage: psr bulk [OPTIONS] FOLDER

  Bulk converts all files in a folder

Options:
  -c, --config TEXT          The configuration code defining how the file
                             should be parsed  [default: za.absa.cheque]
  -p, --password TEXT        The pdf encryption password. If not supplied, it
                             will be requested at the prompt
  -d, --decrypt-suffix TEXT  The suffix to append to the decrypted pdf file
                             when created  [default: _decrypted]
  -k, --keep-decrypted       Keep the a copy of the decrypted file. It is
                             removed by default
  -v, --verbose              Print verbose output while running
  --help                     Show this message and exit.
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pdf-statement-reader",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.13",
    "maintainer_email": null,
    "keywords": "bank statement, digitise, pdf, reader, statement",
    "author": null,
    "author_email": "Marlan Perumal <marlan.perumal@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/8b/89/9100929a9bdc93ac08517bca132f60e529456edd576a7cfcdea22b843240/pdf_statement_reader-0.3.4.tar.gz",
    "platform": null,
    "description": "# PDF Statement Reader\n[![PyPI version](https://badge.fury.io/py/pdf-statement-reader.svg)](https://badge.fury.io/py/pdf-statement-reader)\n[![Coverage Status](https://coveralls.io/repos/github/marlanperumal/pdf_statement_reader/badge.svg)](https://coveralls.io/github/marlanperumal/pdf_statement_reader)\n\nPython library and command line tool for parsing pdf bank statements\n\nInspired by https://github.com/antonburger/pdf2csv\n\n## Objectives\n\nBanks generally send account statements in pdf format. These pdfs are often encrypted, the pdf format is difficult to extract tables from and when you finally get the table out it's in a non tidy format. This package aims to help by providing a library of functions and a set of command line tools for converting these statements into more useful formats such as csv files and pandas dataframes.\n\n## Installation\n\nPython and package management have been set up with [uv](https://docs.astral.sh/uv/). With uv installed and the repo cloned run\n\n```bash\nuv sync\n```\n\nThe CLI tool can then be invoked with\n\n```bash\nuv run psr\n```\n\nAlternatively you can use the `uvx` command to run the tool without installing with\n\n```bash\nuvx --from pdf-statement-reader psr\n```\n\nOr to be less verbose on each call first run\n\n```bash\nuv tool install pdf-statement-reader\n```\n\nThen you'll be able to simply run\n\n```bash\npsr\n```\n\n### Troubleshooting\n\nThis package uses [tabula-py](https://github.com/chezou/tabula-py) under the hood, which itself is a wrapper for [tabula-java](https://github.com/tabulapdf/tabula-java). You thus need to have java installed for it to work. If you have any errors complaining about java, checkout out the `tabula-py` page for troubleshooting advice.\n\nIn the future, we hope to move to a pure python implementation.\n\n## Usage\n\nThe package provides a command line application `psr`\n\n```\nUsage: psr [OPTIONS] COMMAND [ARGS]...\n\n  Utility for reading bank and other statements in pdf form\n\nOptions:\n  --help  Show this message and exit.\n\nCommands:\n  bulk      Bulk converts all files in a folder\n  decrypt   Decrypts a pdf file Uses pikepdf to open an encrypted pdf file...\n  pdf2csv   Converts a pdf statement to a csv file using a given format\n  validate  Validates the csv statement rolling balance\n```\n\n## Configuration\n\nPDF files are notoriously difficult to extract data from. (Here's a nice [blog post](https://www.propublica.org/nerds/heart-of-nerd-darkness-why-dollars-for-docs-was-so-difficult) on why). For a really good semi-manual GUI solution, check out [tabula](https://tabula.technology/). In fact this package uses tabula's pdf parsing library under the hood.\n\nSince bank statements are generally of the same (if inconvenient) format, we can set up a configuration to tell the tool how to grab the data.\n\nFor each type of bank statement, the exact format will be different. A config file holds the instructions for how to process the raw pdf. For now the only config supported is for Cheque account statements from Absa bank in South Africa. \n\nTo set up a different statement, you can simply add a new config file and then tell the `psr` tool to use it. These config files are stored in a folder structure as follows:\n\n    config > [country code] > [bank] > [statement type].json\n\nSo for example the default config is stored in\n\n    config > za > absa > cheque.json\n\nThe config spec is a code of the form\n\n    [country code].[bank].[statement type]\n\nOnce again for the default this will be\n\n    za.absa.cheque\n\nThe configuration file itself is in JSON format. Here's the Absa cheque account one with some commentary to explain what each field does.\n\nThe dimensions to be supplied in the `area` and `columns` parameters are specified in pts, defined as 72 pts in 1 inch. For reference, letter size paper is 8.5 x 11.0 inches (612 x 792 pts) and A4 paper is 8.3 x 11.7 inches (597.6 x 842.4 pts). The origin (0, 0) is located at the top left corner of the page. This is probably most intuitive, however note that it is different to the PDF standard which places the origin at the *bottom* left of the page. \n\n```json5\n{\n    \"$schema\": \"https://raw.githubusercontent.com/marlanperumal/pdf_statement_reader/develop/pdf_statement_reader/config/psr_config.schema.json\",\n    // Describes the page layout that should be scanned\n    \"layout\": { \n        // Default layout for all pages not otherwise defined\n        \"default\": {\n            // The page coordinates in containing the table in pts \n            // [top, left, bottom, right]\n            \"area\": [280, 27, 763, 576],\n            // The right x coordinate of each column in the table in pts\n            \"columns\": [83, 264, 344, 425, 485, 570]\n        },\n        // Layout for the first page\n        \"first\": {\n            \"area\": [480, 27, 763, 576],\n            \"columns\": [83, 264, 344, 425, 485, 570]\n        }\n    },\n\n    // The columns names to be used as they exactly appear\n    // in the statement\n    \"columns\": {\n        \"trans_date\": \"Date\",\n        \"trans_type\": \"Transaction Description\",\n        \"trans_detail\": \"Transaction Detail\",\n        \"debit\": \"Debit Amount\",\n        \"credit\": \"Credit Amount\",\n        \"balance\": \"Balance\"\n    },\n\n    // The order of the columns to be output in the csv\n    \"order\": [\n        \"trans_date\",\n        \"trans_type\",\n        \"trans_detail\",\n        \"debit\",\n        \"credit\",\n        \"balance\"\n    ],\n\n    // Specifies any cleaning operations required\n    \"cleaning\": {\n        // Convert these columns to numeric\n        \"numeric\": [\"debit\", \"credit\", \"balance\"],\n        // Convert these columns to date\n        \"date\": [\"trans_date\"],\n        // Use this date format to parse any date columns\n        \"date_format\": \"%d/%m/%Y\",\n        // For cases where the transaction detail is stored\n        // in the next line below the transaction type\n        \"trans_detail\": \"below\",\n        // Only keep the rows where these columns are populated\n        \"dropna\": [\"balance\"]\n    }\n}\n```\n\nThese were the configuration options that were required for the default format. It is envisaged that as more formats are added, the list of options will grow.\n\nThis format is also captured in `pdf_statement_rader/config/psr_config.schema.json` as a [json-schema](https://json-schema.org/understanding-json-schema/index.html). If you're using vscode or some other compatible text editor, you should get autocompletion hints as long as you include that `$schema` tag at the top of your json file.\n\nA key part in setting up a new configuration is getting the page coordinates for the area and columns. The easiest way to do this is to run the [tabula GUI](https://tabula.technology/), autodetect the page areas, save the settings as a template, then download and inspect json template file. It's not a one-to-one mapping to the psr config but hopefully it will be a good starting point.\n\n## CLI API\n\n### decrypt\n\n```\nUsage: psr decrypt [OPTIONS] INPUT_FILENAME [OUTPUT_FILENAME]\n\n  Decrypts a pdf file\n\n  Uses pikepdf to open an encrypted pdf file and then save the unencrypted\n  version. If no output_filename is specified then overwrites the original\n  file.\n\nOptions:\n  -p, --password TEXT  The pdf encryption password. If not supplied, it will\n                       be requested at the prompt\n  --help               Show this message and exit.\n```\n\n### pdf2csv\n\n```\nUsage: psr pdf2csv [OPTIONS] INPUT_FILENAME [OUTPUT_FILENAME]\n\n  Converts a pdf statement to a csv file using a given format\n\nOptions:\n  -c, --config TEXT  The configuration code defining how the file should be\n                     parsed  [default: za.absa.cheque]\n  --help             Show this message and exit.\n```\n\n### validate\n\n```\nUsage: psr validate [OPTIONS] INPUT_FILENAME\n\n  Validates the csv statement rolling balance\n\nOptions:\n  -c, --config TEXT  The configuration code defining how the file should be\n                     parsed  [default: za.absa.cheque]\n  --help             Show this message and exit.\n```\n\n### bulk\n\n```\nUsage: psr bulk [OPTIONS] FOLDER\n\n  Bulk converts all files in a folder\n\nOptions:\n  -c, --config TEXT          The configuration code defining how the file\n                             should be parsed  [default: za.absa.cheque]\n  -p, --password TEXT        The pdf encryption password. If not supplied, it\n                             will be requested at the prompt\n  -d, --decrypt-suffix TEXT  The suffix to append to the decrypted pdf file\n                             when created  [default: _decrypted]\n  -k, --keep-decrypted       Keep the a copy of the decrypted file. It is\n                             removed by default\n  -v, --verbose              Print verbose output while running\n  --help                     Show this message and exit.\n```\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2019 Marlan Perumal  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
    "summary": "PDF Statement Reader",
    "version": "0.3.4",
    "project_urls": {
        "homepage": "https://github.com/marlanperumal/pdf_statement_reader",
        "issues": "https://github.com/marlanperumal/pdf_statement_reader/issues",
        "source": "https://github.com/marlanperumal/pdf_statement_reader"
    },
    "split_keywords": [
        "bank statement",
        " digitise",
        " pdf",
        " reader",
        " statement"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0d7cb5a3dbc6942fe6fc3f3da4e6b43f775aeb686e1b6b5f3ee60345a91ab2fc",
                "md5": "efa0554bb8948d86e5dbc51e4c63779b",
                "sha256": "abd3da133eb310f25fbe0d9fa6991d6291c4bcce714217ade45c8f8bbcfdf6e6"
            },
            "downloads": -1,
            "filename": "pdf_statement_reader-0.3.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "efa0554bb8948d86e5dbc51e4c63779b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.13",
            "size": 12908,
            "upload_time": "2025-01-16T10:09:04",
            "upload_time_iso_8601": "2025-01-16T10:09:04.042423Z",
            "url": "https://files.pythonhosted.org/packages/0d/7c/b5a3dbc6942fe6fc3f3da4e6b43f775aeb686e1b6b5f3ee60345a91ab2fc/pdf_statement_reader-0.3.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8b899100929a9bdc93ac08517bca132f60e529456edd576a7cfcdea22b843240",
                "md5": "e62f1d015a8547b5ade107b5f50f3707",
                "sha256": "131012daf4963ff864c124b1d0fe4306f469111116fb75052f44effb8e2f2bfe"
            },
            "downloads": -1,
            "filename": "pdf_statement_reader-0.3.4.tar.gz",
            "has_sig": false,
            "md5_digest": "e62f1d015a8547b5ade107b5f50f3707",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.13",
            "size": 9455,
            "upload_time": "2025-01-16T10:09:04",
            "upload_time_iso_8601": "2025-01-16T10:09:04.958084Z",
            "url": "https://files.pythonhosted.org/packages/8b/89/9100929a9bdc93ac08517bca132f60e529456edd576a7cfcdea22b843240/pdf_statement_reader-0.3.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-16 10:09:04",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "marlanperumal",
    "github_project": "pdf_statement_reader",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pdf-statement-reader"
}

None