# PDF to CSV Converter
<p>
<a href="https://pypi.org/project/pdf2csv" target="_blank">
<img src="https://img.shields.io/pypi/v/pdf2csv?color=%2334D058&label=pypi%20package" alt="Package version">
</a>
<a href="https://pypi.org/project/pdf2csv" target="_blank">
<img src="https://img.shields.io/pypi/pyversions/pdf2csv.svg?color=%2334D058" alt="Supported Python versions">
</a>
<a href="https://codecov.io/gh/ghodsizadeh/pdf2csv" target="_blank">
<img src="https://codecov.io/gh/ghodsizadeh/pdf2csv/branch/main/graph/badge.svg" alt="codecov">
</a>
<a>
<img src="https://img.shields.io/github/license/ghodsizadeh/pdf2csv" alt="License">
</a>
<img src="https://img.shields.io/github/stars/ghodsizadeh/pdf2csv" alt="Stars">
<img src="https://img.shields.io/github/issues/ghodsizadeh/pdf2csv" alt="Issues">
<!-- downloads -->
<!-- <img src="https://pepy.tech/badge/pdf2csv" alt="Downloads"> -->
</p>
<img src='./docs/logo.webp' style="display: block; margin: auto; width: 400px; height: auto; box-shadow: 0px 4px 6px rgba(0, 0, 0, 0.1); border-radius: 35px;
padding:40px">
This project provides a tool to convert tables from PDF files into CSV or XLSX format using the Docling library. It extracts tables from PDFs and saves them as CSV or XLSX files, optionally reversing text for right-to-left languages.
## How It Works
1. **PDF Input**: Provide the path to the PDF file you want to convert.
2. **Table Extraction**: The tool uses Docling's `DocumentConverter` to extract tables from the PDF.
3. **DataFrame Conversion**: Each extracted table is converted into a pandas DataFrame.
4. **Optional Text Reversal**: If the `rtl` option is enabled, text in the DataFrame is reversed.
5. **CSV/XLSX Output**: The DataFrames are saved as CSV or XLSX files in the specified output directory.
## Dependencies
This project heavily depends on the [Docling](https://github.com/docling/docling) library for PDF table extraction.
## CLI Usage
You can use the CLI tool to convert PDF files to CSV or XLSX:
```sh
pdf2csv convert-cli <pdf_path> --output-dir <output_dir> --output-format <csv|xlsx> --rtl --verbose
```
Example:
```sh
pdf2csv convert-cli example.pdf --output-dir ./output --output-format xlsx --rtl --verbose
```
## With uvx
You can use the CLI tool with `uvx`:
```sh
uvx pdf2csv convert-cli <pdf_path> --output-dir <output_dir> --output-format <csv|xlsx> --rtl --verbose
```
Example:
```sh
uvx pdf2csv convert-cli example.pdf --output-dir ./output --output-format xlsx --rtl --verbose
```
## Python Usage
You can also use the converter directly in your Python code:
```python
from pdf2csv.converter import convert
pdf_path = "example.pdf"
output_dir = "./output"
rtl = True
output_format = "xlsx"
dfs = convert(pdf_path, output_dir=output_dir, rtl=rtl, output_format=output_format)
for df in dfs:
print(df)
```
## TODO:
- [x] Convert datatype to numeric
- [x] Support for XLSX output
Raw data
{
"_id": null,
"home_page": null,
"name": "pdf2csv",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "pdf, csv, excel, pdf2csv, data extraction, docling",
"author": null,
"author_email": "Mehdi Ghodsizadeh <mehdi.ghodsizadeh@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/ee/56/85489f3d876d226e956239c66ac9cad40ec581fcaf7324b98be03fad7a20/pdf2csv-0.2.2.tar.gz",
"platform": null,
"description": "# PDF to CSV Converter\n<p>\n<a href=\"https://pypi.org/project/pdf2csv\" target=\"_blank\">\n <img src=\"https://img.shields.io/pypi/v/pdf2csv?color=%2334D058&label=pypi%20package\" alt=\"Package version\">\n</a>\n<a href=\"https://pypi.org/project/pdf2csv\" target=\"_blank\">\n <img src=\"https://img.shields.io/pypi/pyversions/pdf2csv.svg?color=%2334D058\" alt=\"Supported Python versions\">\n</a>\n<a href=\"https://codecov.io/gh/ghodsizadeh/pdf2csv\" target=\"_blank\">\n <img src=\"https://codecov.io/gh/ghodsizadeh/pdf2csv/branch/main/graph/badge.svg\" alt=\"codecov\">\n</a>\n<a>\n <img src=\"https://img.shields.io/github/license/ghodsizadeh/pdf2csv\" alt=\"License\">\n</a>\n <img src=\"https://img.shields.io/github/stars/ghodsizadeh/pdf2csv\" alt=\"Stars\">\n <img src=\"https://img.shields.io/github/issues/ghodsizadeh/pdf2csv\" alt=\"Issues\">\n <!-- downloads -->\n <!-- <img src=\"https://pepy.tech/badge/pdf2csv\" alt=\"Downloads\"> -->\n\n\n</p>\n\n<img src='./docs/logo.webp' style=\"display: block; margin: auto; width: 400px; height: auto; box-shadow: 0px 4px 6px rgba(0, 0, 0, 0.1); border-radius: 35px;\npadding:40px\">\n\n\n This project provides a tool to convert tables from PDF files into CSV or XLSX format using the Docling library. It extracts tables from PDFs and saves them as CSV or XLSX files, optionally reversing text for right-to-left languages.\n\n## How It Works\n\n1. **PDF Input**: Provide the path to the PDF file you want to convert.\n2. **Table Extraction**: The tool uses Docling's `DocumentConverter` to extract tables from the PDF.\n3. **DataFrame Conversion**: Each extracted table is converted into a pandas DataFrame.\n4. **Optional Text Reversal**: If the `rtl` option is enabled, text in the DataFrame is reversed.\n5. **CSV/XLSX Output**: The DataFrames are saved as CSV or XLSX files in the specified output directory.\n\n## Dependencies\n\nThis project heavily depends on the [Docling](https://github.com/docling/docling) library for PDF table extraction.\n\n## CLI Usage\n\nYou can use the CLI tool to convert PDF files to CSV or XLSX:\n\n```sh\npdf2csv convert-cli <pdf_path> --output-dir <output_dir> --output-format <csv|xlsx> --rtl --verbose\n```\n\nExample:\n\n```sh\npdf2csv convert-cli example.pdf --output-dir ./output --output-format xlsx --rtl --verbose\n```\n\n## With uvx\n\nYou can use the CLI tool with `uvx`:\n\n```sh\nuvx pdf2csv convert-cli <pdf_path> --output-dir <output_dir> --output-format <csv|xlsx> --rtl --verbose\n```\n\nExample:\n\n```sh\nuvx pdf2csv convert-cli example.pdf --output-dir ./output --output-format xlsx --rtl --verbose\n```\n\n## Python Usage\n\nYou can also use the converter directly in your Python code:\n\n```python\nfrom pdf2csv.converter import convert\n\npdf_path = \"example.pdf\"\noutput_dir = \"./output\"\nrtl = True\noutput_format = \"xlsx\"\n\ndfs = convert(pdf_path, output_dir=output_dir, rtl=rtl, output_format=output_format)\nfor df in dfs:\n print(df)\n```\n\n## TODO:\n- [x] Convert datatype to numeric\n- [x] Support for XLSX output\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "A python library and CLI tool to convert PDF files to CSV files.",
"version": "0.2.2",
"project_urls": {
"Homepage": "https://github.com/ghodsizadeh/pdf2csv",
"Issues": "https://github.com/ghodsizadeh/pdf2csv/issues",
"Repository": "https://github.com/ghodsizadeh/pdf2csv.git"
},
"split_keywords": [
"pdf",
" csv",
" excel",
" pdf2csv",
" data extraction",
" docling"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "13a43630587e6457ab6a7da432229476702d88c27bff8bd3dafb2a2cc18f9f06",
"md5": "ee0fc35a939a1b45284cd7740cd995f9",
"sha256": "a95ec04f521773ff746e3b3e323486c692455708a6b246c5d42862df8cc9b2a7"
},
"downloads": -1,
"filename": "pdf2csv-0.2.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ee0fc35a939a1b45284cd7740cd995f9",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 7130,
"upload_time": "2025-01-06T10:46:53",
"upload_time_iso_8601": "2025-01-06T10:46:53.745639Z",
"url": "https://files.pythonhosted.org/packages/13/a4/3630587e6457ab6a7da432229476702d88c27bff8bd3dafb2a2cc18f9f06/pdf2csv-0.2.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ee5685489f3d876d226e956239c66ac9cad40ec581fcaf7324b98be03fad7a20",
"md5": "84444e03a9ed3705a2e039b0888f67ef",
"sha256": "7e32fae06dc26409f97bc1e32febf08fcc1e8d2076462bb228a7d133e4b5af42"
},
"downloads": -1,
"filename": "pdf2csv-0.2.2.tar.gz",
"has_sig": false,
"md5_digest": "84444e03a9ed3705a2e039b0888f67ef",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 6238,
"upload_time": "2025-01-06T10:46:55",
"upload_time_iso_8601": "2025-01-06T10:46:55.174669Z",
"url": "https://files.pythonhosted.org/packages/ee/56/85489f3d876d226e956239c66ac9cad40ec581fcaf7324b98be03fad7a20/pdf2csv-0.2.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-06 10:46:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ghodsizadeh",
"github_project": "pdf2csv",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pdf2csv"
}