# Invoice Extractor
The Invoice Extractor is a small utility that helps you extract data from invoices in PDF format.
So you don't have to open pdf files one by one copy them etc..
It will output an xlsx file or csv file with sender (creditor), receipent (debtor), invoice date, total amount, tax, amount with tax etc.

So you can dump invoices in a folder and then run this when you need to send your taxes.
If you run it multiple times it will skip parsing existing files. It will parse and append new files only.
So you can add invoices in a folder and you can run this time to time.
It can also output a csv file.
## How it works
We read the pdf, extract the data and then send to OpenAI ChatGPT.
## Disclaimers:
- If you have any privacy concerns regarding ChatGPT you might want to fork this repo and connect it to your own local llm.
- This works for me, I've tried it with multiple different invoices - I keep listing my expenses like this.
But statistically I have a very limited dataset. This means you can have issues, wrong data extracted etc.
So might be better to double check things, especially for new invoice types.
- This software is provided "as is" without warranty of any kind. The information and functionality provided by Invoice Extractor are for general informational purposes only and should not be considered as financial, legal, or professional advice. Use the software at your own risk. The project contributors and maintainers are not responsible for any loss, damage, or harm caused by the use of this software. Always consult with a qualified professional for any financial or legal matters.
## Installation
1. Make sure you have an OpenAI account and have an API key. If not visit https://openai.com/gpt-4.
2. Export the api key
```bash
export OPENAI_API_KEY="sk-nct...."
```
3. Make sure you have Python installed (Python 3.10 or later).
4. Install the package using pip:
```
pip install InvoiceExtractor
```
## Usage
The package provides a command-line interface (CLI) to extract data from PDF files. Here are the available commands:
### process-dir
Process all PDF files in a directory. This will output an xlsx file by default with
```
invoice-extractor process-dir <directory_path>
```
The extracted data will be saved to a summary Excel file named summary.xls in the current directory.
#### Example
```
# Process all PDF files in a directory
invoice-extractor process-dir invoices_folder
```
### extract-pdf
Extract data from a single PDF file.
```
# Extract data from a single PDF file
invoice-extractor extract-pdf --pdf-file <path_to_pdf_file>
# example
invoice-extractor extract-pdf --pdf-file invoice.pdf
```
### Questions
Q: Does it work with only invoices in English
- No, I use it with invoices in Dutch too, language doesn't matter
Q: Do I have to name files properly ?
- No, we don't care about file names at all. Just if you change file name, it will think it's a new invoice
Q: I took a photo of an invoice...
- Currently it doesn't work with images. I probably will integrate some ocr in future.
Q: It doesn't parse my invoice properly
- Pleaese submit an issue
Q: I use Google sheets, can you output to google sheets ?
- You can import/upload the summary.xlsx to google sheets. It works.
Q: There are other/better Saas/products
- Yes, I'm releasing this thinking it might be helpful to some, this is not a full featured product.
### Planned Features
- Split outgoing files in credit/debit style. It puts them all together. I have expenses folder and invoiced folders and use two worksheets.
- It doesn't parse receipts/slips - eg: when you go to a restaurant, gas station. Only pdf invoices.
### Dependencies
The Invoice Extractor package depends on the following Python packages. These will be automatically installed during the package installation:
- click
- openpyxl
- openai
- langchain
## Contribute
Contributions are welcome! If you find any issues or have suggestions for improvements, feel free to open a pull request. Please make sure to follow the guidelines in the CONTRIBUTING.md file.
### License
This project is licensed under the MIT License - see the LICENSE file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/ybrs/invoice-extractor",
"name": "InvoiceExtractor",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Aybars Badur",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/87/fc/ae8f80ecf280822ddef59869d0c1d430088a5e5fa9559b0676fc41f4f40f/InvoiceExtractor-0.1.2.tar.gz",
"platform": null,
"description": "# Invoice Extractor\n\nThe Invoice Extractor is a small utility that helps you extract data from invoices in PDF format.\nSo you don't have to open pdf files one by one copy them etc.. \n\nIt will output an xlsx file or csv file with sender (creditor), receipent (debtor), invoice date, total amount, tax, amount with tax etc.\n\n\nSo you can dump invoices in a folder and then run this when you need to send your taxes.\n\nIf you run it multiple times it will skip parsing existing files. It will parse and append new files only. \n\nSo you can add invoices in a folder and you can run this time to time.\n\nIt can also output a csv file. \n\n## How it works\n\nWe read the pdf, extract the data and then send to OpenAI ChatGPT. \n\n## Disclaimers: \n- If you have any privacy concerns regarding ChatGPT you might want to fork this repo and connect it to your own local llm.\n- This works for me, I've tried it with multiple different invoices - I keep listing my expenses like this. \nBut statistically I have a very limited dataset. This means you can have issues, wrong data extracted etc. \nSo might be better to double check things, especially for new invoice types.\n- This software is provided \"as is\" without warranty of any kind. The information and functionality provided by Invoice Extractor are for general informational purposes only and should not be considered as financial, legal, or professional advice. Use the software at your own risk. The project contributors and maintainers are not responsible for any loss, damage, or harm caused by the use of this software. Always consult with a qualified professional for any financial or legal matters.\n\n## Installation\n\n1. Make sure you have an OpenAI account and have an API key. If not visit https://openai.com/gpt-4.\n2. Export the api key \n```bash\nexport OPENAI_API_KEY=\"sk-nct....\"\n```\n3. Make sure you have Python installed (Python 3.10 or later).\n4. Install the package using pip:\n\n```\npip install InvoiceExtractor\n```\n\n## Usage\n\nThe package provides a command-line interface (CLI) to extract data from PDF files. Here are the available commands:\n\n### process-dir\n\nProcess all PDF files in a directory. This will output an xlsx file by default with \n\n```\ninvoice-extractor process-dir <directory_path>\n```\nThe extracted data will be saved to a summary Excel file named summary.xls in the current directory.\n\n#### Example\n```\n# Process all PDF files in a directory\ninvoice-extractor process-dir invoices_folder\n```\n\n\n### extract-pdf\n\nExtract data from a single PDF file.\n```\n# Extract data from a single PDF file\ninvoice-extractor extract-pdf --pdf-file <path_to_pdf_file>\n# example\ninvoice-extractor extract-pdf --pdf-file invoice.pdf\n```\n\n### Questions\n\nQ: Does it work with only invoices in English \n- No, I use it with invoices in Dutch too, language doesn't matter\n\nQ: Do I have to name files properly ?\n- No, we don't care about file names at all. Just if you change file name, it will think it's a new invoice\n\nQ: I took a photo of an invoice...\n- Currently it doesn't work with images. I probably will integrate some ocr in future.\n\nQ: It doesn't parse my invoice properly\n- Pleaese submit an issue\n\nQ: I use Google sheets, can you output to google sheets ?\n- You can import/upload the summary.xlsx to google sheets. It works.\n\nQ: There are other/better Saas/products \n- Yes, I'm releasing this thinking it might be helpful to some, this is not a full featured product.\n\n### Planned Features\n- Split outgoing files in credit/debit style. It puts them all together. I have expenses folder and invoiced folders and use two worksheets.\n- It doesn't parse receipts/slips - eg: when you go to a restaurant, gas station. Only pdf invoices.\n\n### Dependencies\n\nThe Invoice Extractor package depends on the following Python packages. These will be automatically installed during the package installation:\n\n- click\n- openpyxl\n- openai\n- langchain\n\n\n## Contribute\n\nContributions are welcome! If you find any issues or have suggestions for improvements, feel free to open a pull request. Please make sure to follow the guidelines in the CONTRIBUTING.md file.\n\n### License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n",
"bugtrack_url": null,
"license": "",
"summary": "A Python package for extracting data from invoices.",
"version": "0.1.2",
"project_urls": {
"Homepage": "https://github.com/ybrs/invoice-extractor"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "87fcae8f80ecf280822ddef59869d0c1d430088a5e5fa9559b0676fc41f4f40f",
"md5": "63c44ef2f1d495093dae1f65de4dce5f",
"sha256": "6ffb408bdf4b04fa23c2a3ad9ca102a3963d2f757dfe1e758e02fbe47cbcf5fc"
},
"downloads": -1,
"filename": "InvoiceExtractor-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "63c44ef2f1d495093dae1f65de4dce5f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 6201,
"upload_time": "2023-07-21T16:07:38",
"upload_time_iso_8601": "2023-07-21T16:07:38.436118Z",
"url": "https://files.pythonhosted.org/packages/87/fc/ae8f80ecf280822ddef59869d0c1d430088a5e5fa9559b0676fc41f4f40f/InvoiceExtractor-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-21 16:07:38",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ybrs",
"github_project": "invoice-extractor",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "pdfplumber",
"specs": [
[
"==",
"0.10.1"
]
]
},
{
"name": "langchain",
"specs": [
[
"==",
"0.0.238"
]
]
},
{
"name": "openai",
"specs": [
[
"==",
"0.27.8"
]
]
},
{
"name": "pdfplumber",
"specs": [
[
"==",
"0.10.1"
]
]
},
{
"name": "pdfminer",
"specs": [
[
"==",
"20191125"
]
]
},
{
"name": "click",
"specs": [
[
"==",
"8.1.6"
]
]
},
{
"name": "openpyxl",
"specs": [
[
"==",
"3.1.2"
]
]
}
],
"lcname": "invoiceextractor"
}