pdf2csv


Namepdf2csv JSON
Version 0.2.2 PyPI version JSON
download
home_pageNone
SummaryA python library and CLI tool to convert PDF files to CSV files.
upload_time2025-01-06 10:46:55
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT License
keywords pdf csv excel pdf2csv data extraction docling
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PDF to CSV Converter
<p>
<a href="https://pypi.org/project/pdf2csv" target="_blank">
    <img src="https://img.shields.io/pypi/v/pdf2csv?color=%2334D058&label=pypi%20package" alt="Package version">
</a>
<a href="https://pypi.org/project/pdf2csv" target="_blank">
    <img src="https://img.shields.io/pypi/pyversions/pdf2csv.svg?color=%2334D058" alt="Supported Python versions">
</a>
<a href="https://codecov.io/gh/ghodsizadeh/pdf2csv" target="_blank">
    <img src="https://codecov.io/gh/ghodsizadeh/pdf2csv/branch/main/graph/badge.svg" alt="codecov">
</a>
<a>
    <img src="https://img.shields.io/github/license/ghodsizadeh/pdf2csv" alt="License">
</a>
    <img src="https://img.shields.io/github/stars/ghodsizadeh/pdf2csv" alt="Stars">
    <img src="https://img.shields.io/github/issues/ghodsizadeh/pdf2csv" alt="Issues">
    <!-- downloads -->
    <!-- <img src="https://pepy.tech/badge/pdf2csv" alt="Downloads"> -->


</p>

<img src='./docs/logo.webp' style="display: block; margin: auto; width: 400px; height: auto; box-shadow: 0px 4px 6px rgba(0, 0, 0, 0.1); border-radius: 35px;
padding:40px">


 This project provides a tool to convert tables from PDF files into CSV or XLSX format using the Docling library. It extracts tables from PDFs and saves them as CSV or XLSX files, optionally reversing text for right-to-left languages.

## How It Works

1. **PDF Input**: Provide the path to the PDF file you want to convert.
2. **Table Extraction**: The tool uses Docling's `DocumentConverter` to extract tables from the PDF.
3. **DataFrame Conversion**: Each extracted table is converted into a pandas DataFrame.
4. **Optional Text Reversal**: If the `rtl` option is enabled, text in the DataFrame is reversed.
5. **CSV/XLSX Output**: The DataFrames are saved as CSV or XLSX files in the specified output directory.

## Dependencies

This project heavily depends on the [Docling](https://github.com/docling/docling) library for PDF table extraction.

## CLI Usage

You can use the CLI tool to convert PDF files to CSV or XLSX:

```sh
pdf2csv convert-cli <pdf_path> --output-dir <output_dir> --output-format <csv|xlsx> --rtl --verbose
```

Example:

```sh
pdf2csv convert-cli example.pdf --output-dir ./output --output-format xlsx --rtl --verbose
```

## With uvx

You can use the CLI tool with `uvx`:

```sh
uvx pdf2csv convert-cli <pdf_path> --output-dir <output_dir> --output-format <csv|xlsx> --rtl --verbose
```

Example:

```sh
uvx pdf2csv convert-cli example.pdf --output-dir ./output --output-format xlsx --rtl --verbose
```

## Python Usage

You can also use the converter directly in your Python code:

```python
from pdf2csv.converter import convert

pdf_path = "example.pdf"
output_dir = "./output"
rtl = True
output_format = "xlsx"

dfs = convert(pdf_path, output_dir=output_dir, rtl=rtl, output_format=output_format)
for df in dfs:
    print(df)
```

## TODO:
- [x] Convert datatype to numeric
- [x] Support for XLSX output

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pdf2csv",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "pdf, csv, excel, pdf2csv, data extraction, docling",
    "author": null,
    "author_email": "Mehdi Ghodsizadeh <mehdi.ghodsizadeh@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/ee/56/85489f3d876d226e956239c66ac9cad40ec581fcaf7324b98be03fad7a20/pdf2csv-0.2.2.tar.gz",
    "platform": null,
    "description": "# PDF to CSV Converter\n<p>\n<a href=\"https://pypi.org/project/pdf2csv\" target=\"_blank\">\n    <img src=\"https://img.shields.io/pypi/v/pdf2csv?color=%2334D058&label=pypi%20package\" alt=\"Package version\">\n</a>\n<a href=\"https://pypi.org/project/pdf2csv\" target=\"_blank\">\n    <img src=\"https://img.shields.io/pypi/pyversions/pdf2csv.svg?color=%2334D058\" alt=\"Supported Python versions\">\n</a>\n<a href=\"https://codecov.io/gh/ghodsizadeh/pdf2csv\" target=\"_blank\">\n    <img src=\"https://codecov.io/gh/ghodsizadeh/pdf2csv/branch/main/graph/badge.svg\" alt=\"codecov\">\n</a>\n<a>\n    <img src=\"https://img.shields.io/github/license/ghodsizadeh/pdf2csv\" alt=\"License\">\n</a>\n    <img src=\"https://img.shields.io/github/stars/ghodsizadeh/pdf2csv\" alt=\"Stars\">\n    <img src=\"https://img.shields.io/github/issues/ghodsizadeh/pdf2csv\" alt=\"Issues\">\n    <!-- downloads -->\n    <!-- <img src=\"https://pepy.tech/badge/pdf2csv\" alt=\"Downloads\"> -->\n\n\n</p>\n\n<img src='./docs/logo.webp' style=\"display: block; margin: auto; width: 400px; height: auto; box-shadow: 0px 4px 6px rgba(0, 0, 0, 0.1); border-radius: 35px;\npadding:40px\">\n\n\n This project provides a tool to convert tables from PDF files into CSV or XLSX format using the Docling library. It extracts tables from PDFs and saves them as CSV or XLSX files, optionally reversing text for right-to-left languages.\n\n## How It Works\n\n1. **PDF Input**: Provide the path to the PDF file you want to convert.\n2. **Table Extraction**: The tool uses Docling's `DocumentConverter` to extract tables from the PDF.\n3. **DataFrame Conversion**: Each extracted table is converted into a pandas DataFrame.\n4. **Optional Text Reversal**: If the `rtl` option is enabled, text in the DataFrame is reversed.\n5. **CSV/XLSX Output**: The DataFrames are saved as CSV or XLSX files in the specified output directory.\n\n## Dependencies\n\nThis project heavily depends on the [Docling](https://github.com/docling/docling) library for PDF table extraction.\n\n## CLI Usage\n\nYou can use the CLI tool to convert PDF files to CSV or XLSX:\n\n```sh\npdf2csv convert-cli <pdf_path> --output-dir <output_dir> --output-format <csv|xlsx> --rtl --verbose\n```\n\nExample:\n\n```sh\npdf2csv convert-cli example.pdf --output-dir ./output --output-format xlsx --rtl --verbose\n```\n\n## With uvx\n\nYou can use the CLI tool with `uvx`:\n\n```sh\nuvx pdf2csv convert-cli <pdf_path> --output-dir <output_dir> --output-format <csv|xlsx> --rtl --verbose\n```\n\nExample:\n\n```sh\nuvx pdf2csv convert-cli example.pdf --output-dir ./output --output-format xlsx --rtl --verbose\n```\n\n## Python Usage\n\nYou can also use the converter directly in your Python code:\n\n```python\nfrom pdf2csv.converter import convert\n\npdf_path = \"example.pdf\"\noutput_dir = \"./output\"\nrtl = True\noutput_format = \"xlsx\"\n\ndfs = convert(pdf_path, output_dir=output_dir, rtl=rtl, output_format=output_format)\nfor df in dfs:\n    print(df)\n```\n\n## TODO:\n- [x] Convert datatype to numeric\n- [x] Support for XLSX output\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "A python library and CLI tool to convert PDF files to CSV files.",
    "version": "0.2.2",
    "project_urls": {
        "Homepage": "https://github.com/ghodsizadeh/pdf2csv",
        "Issues": "https://github.com/ghodsizadeh/pdf2csv/issues",
        "Repository": "https://github.com/ghodsizadeh/pdf2csv.git"
    },
    "split_keywords": [
        "pdf",
        " csv",
        " excel",
        " pdf2csv",
        " data extraction",
        " docling"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "13a43630587e6457ab6a7da432229476702d88c27bff8bd3dafb2a2cc18f9f06",
                "md5": "ee0fc35a939a1b45284cd7740cd995f9",
                "sha256": "a95ec04f521773ff746e3b3e323486c692455708a6b246c5d42862df8cc9b2a7"
            },
            "downloads": -1,
            "filename": "pdf2csv-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ee0fc35a939a1b45284cd7740cd995f9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 7130,
            "upload_time": "2025-01-06T10:46:53",
            "upload_time_iso_8601": "2025-01-06T10:46:53.745639Z",
            "url": "https://files.pythonhosted.org/packages/13/a4/3630587e6457ab6a7da432229476702d88c27bff8bd3dafb2a2cc18f9f06/pdf2csv-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ee5685489f3d876d226e956239c66ac9cad40ec581fcaf7324b98be03fad7a20",
                "md5": "84444e03a9ed3705a2e039b0888f67ef",
                "sha256": "7e32fae06dc26409f97bc1e32febf08fcc1e8d2076462bb228a7d133e4b5af42"
            },
            "downloads": -1,
            "filename": "pdf2csv-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "84444e03a9ed3705a2e039b0888f67ef",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 6238,
            "upload_time": "2025-01-06T10:46:55",
            "upload_time_iso_8601": "2025-01-06T10:46:55.174669Z",
            "url": "https://files.pythonhosted.org/packages/ee/56/85489f3d876d226e956239c66ac9cad40ec581fcaf7324b98be03fad7a20/pdf2csv-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-06 10:46:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ghodsizadeh",
    "github_project": "pdf2csv",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pdf2csv"
}
        
Elapsed time: 1.05104s