<p align="center">
<img src="https://github.com/py-pdf/pypdf_table_extraction/blob/main/docs/_static/pypdf-table-extraction.png" width="200">
</p>
# pypdf_table_extraction (Camelot): PDF Table Extraction for Humans
[![tests](https://github.com/py-pdf/pypdf_table_extraction/actions/workflows/tests.yml/badge.svg)](https://github.com/py-pdf/pypdf_table_extraction/actions/workflows/tests.yml) [![Documentation Status](https://readthedocs.org/projects/pypdf-table-extraction/badge/?version=latest)](https://pypdf-table-extraction.readthedocs.io/en/latest/)
[![codecov.io](https://codecov.io/github/py-pdf/pypdf_table_extraction/badge.svg?branch=main&service=github)](https://codecov.io/github/py-pdf/pypdf_table_extraction/?branch=main)
[![image](https://img.shields.io/pypi/v/pypdf-table-extraction.svg)](https://pypi.org/project/pypdf-table-extraction/) [![image](https://img.shields.io/pypi/l/pypdf-table-extraction.svg)](https://pypi.org/project/pypdf-table-extraction/) [![image](https://img.shields.io/pypi/pyversions/pypdf-table-extraction.svg)](https://pypi.org/project/pypdf-table-extraction/)
**pypdf_table_extraction** Formerly known as [Camelot](https://github.com/camelot-dev/camelot) is a Python library that can help you extract tables from PDFs!
---
**Here's how you can extract tables from PDFs.**
You can check out the quickstart notebook. [![image](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/py-pdf/pypdf_table_extraction/blob/main/examples/pypdf_table_extraction_quick_start_notebook.ipynb)
Or follow the example below.
You can check out the PDF used in this example [here](https://github.com/py-pdf/pypdf_table_extraction/blob/main/docs/_static/pdf/foo.pdf).
```python3
>>> import pypdf_table_extraction
>>> tables = pypdf_table_extraction.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
'accuracy': 99.02,
'whitespace': 12.24,
'order': 1,
'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
>>> tables[0].df # get a pandas DataFrame!
```
| Cycle Name | KI (1/km) | Distance (mi) | Percent Fuel Savings | | | |
| ---------- | --------- | ------------- | -------------------- | --------------- | --------------- | -------------- |
| | | | Improved Speed | Decreased Accel | Eliminate Stops | Decreased Idle |
| 2012_2 | 3.30 | 1.3 | 5.9% | 9.5% | 29.2% | 17.4% |
| 2145_1 | 0.68 | 11.2 | 2.4% | 0.1% | 9.5% | 2.7% |
| 4234_1 | 0.59 | 58.7 | 8.5% | 1.3% | 8.5% | 3.3% |
| 2032_2 | 0.17 | 57.8 | 21.7% | 0.3% | 2.7% | 1.2% |
| 4171_1 | 0.07 | 173.9 | 58.1% | 1.6% | 2.1% | 0.5% |
pypdf_table_extraction also comes packaged with a [command-line interface](https://pypdf-table-extraction.readthedocs.io/en/latest/user/cli.html)!
Refer to the [QuickStart Guide](https://github.com/py-pdf/pypdf_table_extraction/blob/main/docs/user/quickstart.rst#quickstart) to quickly get started with pypdf_table_extraction, extract tables from PDFs and explore some basic options.
**Tip:** Visit the `parser-comparison-notebook` to get an overview of all the packed parsers and their features. [![image](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/py-pdf/pypdf_table_extraction/blob/main/examples/parser-comparison-notebook.ipynb)
**Note:** pypdf_table_extraction only works with text-based PDFs and not scanned documents. (As Tabula [explains](https://github.com/tabulapdf/tabula#why-tabula), "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
You can check out some frequently asked questions [here](https://pypdf-table-extraction.readthedocs.io/en/latest/user/faq.html).
## Why pypdf_table_extraction?
- **Configurability**: pypdf_table_extraction gives you control over the table extraction process with [tweakable settings](https://pypdf-table-extraction.readthedocs.io/en/latest/user/advanced.html).
- **Metrics**: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table.
- **Output**: Each table is extracted into a **pandas DataFrame**, which seamlessly integrates into [ETL and data analysis workflows](https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873). You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite.
See [comparison with similar libraries and tools](https://github.com/py-pdf/pypdf_table_extraction/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools).
## Installation
### Using conda
The easiest way to install pypdf_table_extraction is with [conda](https://conda.io/docs/), which is a package manager and environment management system for the [Anaconda](http://docs.continuum.io/anaconda/) distribution.
```bash
conda install -c conda-forge pypdf-table-extraction
```
### Using pip
After [installing the dependencies](https://pypdf-table-extraction.readthedocs.io/en/latest/user/install-deps.html) ([tk](https://packages.ubuntu.com/bionic/python/python-tk) and [ghostscript](https://www.ghostscript.com/)), you can also just use pip to install pypdf_table_extraction:
```bash
pip install pypdf-table-extraction
```
### From the source code
After [installing the dependencies](https://pypdf-table-extraction.readthedocs.io/en/latest/user/install.html#using-pip), clone the repo using:
```bash
git clone https://github.com/py-pdf/pypdf_table_extraction.git
```
and install using pip:
```
cd pypdf_table_extraction
pip install "."
```
## Documentation
The documentation is available at [http://pypdf-table-extraction.readthedocs.io/](http://pypdf-table-extraction.readthedocs.io/).
## Wrappers
- [camelot-php](https://github.com/randomstate/camelot-php) provides a [PHP](https://www.php.net/) wrapper on Camelot.
## Related projects
- [camelot-sharp](https://github.com/BobLd/camelot-sharp) provides a C sharp implementation of pypdf_table_extraction (Camelot).
## Contributing
The [Contributor's Guide](https://pypdf-table-extraction.readthedocs.io/en/latest/dev/contributing.html) has detailed information about contributing issues, documentation, code, and tests.
## Versioning
pypdf_table_extraction uses [Semantic Versioning](https://semver.org/). For the available versions, see the tags on this repository. For the changelog, you can check out the [releases](https://github.com/py-pdf/pypdf_table_extraction/releases) page.
## License
This project is licensed under the MIT License, see the [LICENSE](https://github.com/py-pdf/pypdf_table_extraction/blob/main/LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/py-pdf/pypdf_table_extraction",
"name": "pypdf-table-extraction",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Vinayak Mehta",
"author_email": "vmehta94@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/c7/cf/05deec62decea3609cc88a394493bd31e524bab46f4fda0e1f37487c049c/pypdf_table_extraction-1.0.1.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <img src=\"https://github.com/py-pdf/pypdf_table_extraction/blob/main/docs/_static/pypdf-table-extraction.png\" width=\"200\">\n</p>\n\n# pypdf_table_extraction (Camelot): PDF Table Extraction for Humans\n\n[![tests](https://github.com/py-pdf/pypdf_table_extraction/actions/workflows/tests.yml/badge.svg)](https://github.com/py-pdf/pypdf_table_extraction/actions/workflows/tests.yml) [![Documentation Status](https://readthedocs.org/projects/pypdf-table-extraction/badge/?version=latest)](https://pypdf-table-extraction.readthedocs.io/en/latest/)\n[![codecov.io](https://codecov.io/github/py-pdf/pypdf_table_extraction/badge.svg?branch=main&service=github)](https://codecov.io/github/py-pdf/pypdf_table_extraction/?branch=main)\n[![image](https://img.shields.io/pypi/v/pypdf-table-extraction.svg)](https://pypi.org/project/pypdf-table-extraction/) [![image](https://img.shields.io/pypi/l/pypdf-table-extraction.svg)](https://pypi.org/project/pypdf-table-extraction/) [![image](https://img.shields.io/pypi/pyversions/pypdf-table-extraction.svg)](https://pypi.org/project/pypdf-table-extraction/)\n\n**pypdf_table_extraction** Formerly known as [Camelot](https://github.com/camelot-dev/camelot) is a Python library that can help you extract tables from PDFs!\n\n---\n\n**Here's how you can extract tables from PDFs.**\nYou can check out the quickstart notebook. [![image](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/py-pdf/pypdf_table_extraction/blob/main/examples/pypdf_table_extraction_quick_start_notebook.ipynb)\n\nOr follow the example below.\nYou can check out the PDF used in this example [here](https://github.com/py-pdf/pypdf_table_extraction/blob/main/docs/_static/pdf/foo.pdf).\n\n```python3\n>>> import pypdf_table_extraction\n>>> tables = pypdf_table_extraction.read_pdf('foo.pdf')\n>>> tables\n<TableList n=1>\n>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite\n>>> tables[0]\n<Table shape=(7, 7)>\n>>> tables[0].parsing_report\n{\n 'accuracy': 99.02,\n 'whitespace': 12.24,\n 'order': 1,\n 'page': 1\n}\n>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite\n>>> tables[0].df # get a pandas DataFrame!\n```\n\n| Cycle Name | KI (1/km) | Distance (mi) | Percent Fuel Savings | | | |\n| ---------- | --------- | ------------- | -------------------- | --------------- | --------------- | -------------- |\n| | | | Improved Speed | Decreased Accel | Eliminate Stops | Decreased Idle |\n| 2012_2 | 3.30 | 1.3 | 5.9% | 9.5% | 29.2% | 17.4% |\n| 2145_1 | 0.68 | 11.2 | 2.4% | 0.1% | 9.5% | 2.7% |\n| 4234_1 | 0.59 | 58.7 | 8.5% | 1.3% | 8.5% | 3.3% |\n| 2032_2 | 0.17 | 57.8 | 21.7% | 0.3% | 2.7% | 1.2% |\n| 4171_1 | 0.07 | 173.9 | 58.1% | 1.6% | 2.1% | 0.5% |\n\npypdf_table_extraction also comes packaged with a [command-line interface](https://pypdf-table-extraction.readthedocs.io/en/latest/user/cli.html)!\n\nRefer to the [QuickStart Guide](https://github.com/py-pdf/pypdf_table_extraction/blob/main/docs/user/quickstart.rst#quickstart) to quickly get started with pypdf_table_extraction, extract tables from PDFs and explore some basic options.\n\n**Tip:** Visit the `parser-comparison-notebook` to get an overview of all the packed parsers and their features. [![image](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/py-pdf/pypdf_table_extraction/blob/main/examples/parser-comparison-notebook.ipynb)\n\n**Note:** pypdf_table_extraction only works with text-based PDFs and not scanned documents. (As Tabula [explains](https://github.com/tabulapdf/tabula#why-tabula), \"If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based\".)\n\nYou can check out some frequently asked questions [here](https://pypdf-table-extraction.readthedocs.io/en/latest/user/faq.html).\n\n## Why pypdf_table_extraction?\n\n- **Configurability**: pypdf_table_extraction gives you control over the table extraction process with [tweakable settings](https://pypdf-table-extraction.readthedocs.io/en/latest/user/advanced.html).\n- **Metrics**: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table.\n- **Output**: Each table is extracted into a **pandas DataFrame**, which seamlessly integrates into [ETL and data analysis workflows](https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873). You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite.\n\nSee [comparison with similar libraries and tools](https://github.com/py-pdf/pypdf_table_extraction/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools).\n\n## Installation\n\n### Using conda\n\nThe easiest way to install pypdf_table_extraction is with [conda](https://conda.io/docs/), which is a package manager and environment management system for the [Anaconda](http://docs.continuum.io/anaconda/) distribution.\n\n```bash\nconda install -c conda-forge pypdf-table-extraction\n```\n\n### Using pip\n\nAfter [installing the dependencies](https://pypdf-table-extraction.readthedocs.io/en/latest/user/install-deps.html) ([tk](https://packages.ubuntu.com/bionic/python/python-tk) and [ghostscript](https://www.ghostscript.com/)), you can also just use pip to install pypdf_table_extraction:\n\n```bash\npip install pypdf-table-extraction\n```\n\n### From the source code\n\nAfter [installing the dependencies](https://pypdf-table-extraction.readthedocs.io/en/latest/user/install.html#using-pip), clone the repo using:\n\n```bash\ngit clone https://github.com/py-pdf/pypdf_table_extraction.git\n```\n\nand install using pip:\n\n```\ncd pypdf_table_extraction\npip install \".\"\n```\n\n## Documentation\n\nThe documentation is available at [http://pypdf-table-extraction.readthedocs.io/](http://pypdf-table-extraction.readthedocs.io/).\n\n## Wrappers\n\n- [camelot-php](https://github.com/randomstate/camelot-php) provides a [PHP](https://www.php.net/) wrapper on Camelot.\n\n## Related projects\n\n- [camelot-sharp](https://github.com/BobLd/camelot-sharp) provides a C sharp implementation of pypdf_table_extraction (Camelot).\n\n## Contributing\n\nThe [Contributor's Guide](https://pypdf-table-extraction.readthedocs.io/en/latest/dev/contributing.html) has detailed information about contributing issues, documentation, code, and tests.\n\n## Versioning\n\npypdf_table_extraction uses [Semantic Versioning](https://semver.org/). For the available versions, see the tags on this repository. For the changelog, you can check out the [releases](https://github.com/py-pdf/pypdf_table_extraction/releases) page.\n\n## License\n\nThis project is licensed under the MIT License, see the [LICENSE](https://github.com/py-pdf/pypdf_table_extraction/blob/main/LICENSE) file for details.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "PDF Table Extraction for Humans.",
"version": "1.0.1",
"project_urls": {
"Changelog": "https://github.com/py-pdf/pypdf_table_extraction/releases",
"Documentation": "https://pypdf-table-extraction.readthedocs.io/",
"Homepage": "https://github.com/py-pdf/pypdf_table_extraction",
"Repository": "https://github.com/py-pdf/pypdf_table_extraction"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "df781ec233d059bfa4b3b9820f7af18b89ff5f3df6ddbf28e5bb81e76acea6bc",
"md5": "2f95bd301bab39a50497358bb31dca52",
"sha256": "70361798ec5a152c96828e5c26f1b7ce569e523172df5d8731594f6908cdfe19"
},
"downloads": -1,
"filename": "pypdf_table_extraction-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2f95bd301bab39a50497358bb31dca52",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.8",
"size": 71915,
"upload_time": "2024-11-13T20:52:53",
"upload_time_iso_8601": "2024-11-13T20:52:53.588682Z",
"url": "https://files.pythonhosted.org/packages/df/78/1ec233d059bfa4b3b9820f7af18b89ff5f3df6ddbf28e5bb81e76acea6bc/pypdf_table_extraction-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c7cf05deec62decea3609cc88a394493bd31e524bab46f4fda0e1f37487c049c",
"md5": "49a002f36b01e56ffe919b95234ba75c",
"sha256": "32e14f5ee1b53242d4b4d76cd5294e5d200c7f8303cc58f1ce278a62749c327c"
},
"downloads": -1,
"filename": "pypdf_table_extraction-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "49a002f36b01e56ffe919b95234ba75c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.8",
"size": 60254,
"upload_time": "2024-11-13T20:52:54",
"upload_time_iso_8601": "2024-11-13T20:52:54.729227Z",
"url": "https://files.pythonhosted.org/packages/c7/cf/05deec62decea3609cc88a394493bd31e524bab46f4fda0e1f37487c049c/pypdf_table_extraction-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-13 20:52:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "py-pdf",
"github_project": "pypdf_table_extraction",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pypdf-table-extraction"
}