txt-from-pdf


Nametxt-from-pdf JSON
Version 1.1.0 PyPI version JSON
download
home_pagehttps://github.com/alisafaya/txt-from-pdf
SummaryExtract clean text from PDFs.
upload_time2024-05-03 15:40:09
maintainerNone
docs_urlNone
authorAli Safaya
requires_python>=3.7.0
licenseApache
keywords pdf text extraction
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# txt-from-pdf: Extract clean text from PDFs

[![Github release](https://img.shields.io/github/release/alisafaya/txt-from-pdf.svg)](https://github.com/alisafaya/txt-from-pdf/releases)
[![PyPI version](https://badge.fury.io/py/txt-from-pdf.svg)](https://badge.fury.io/py/txt-from-pdf)
[![GitHub license](https://img.shields.io/github/license/alisafaya/txt-from-pdf.svg)](./LICENSE)

Extracting text from pdfs using [pdfminer.six](https://github.com/pdfminer/pdfminer.six) and [pypdf](https://github.com/py-pdf/pypdf/). Adapted from [PDFextract](https://github.com/sdtblck/PDFextract).

# Installation

```bash
pip install txt-from-pdf
```

# Usage

```python
from txtfrompdf import extract_txt_from_pdf

pdf_path = "file.pdf"
text = extract_txt_from_pdf(pdf_path)
print(text)
```

# CLI Usage

Single file:

```bash
txt-from-pdf --input file.pdf --output extracted-text 
```

Multiple files in a directory:

```bash
txt-from-pdf --input dir-with-pdfs --output extracted-text 
```

Detailed help:

```bash
usage: txt-from-pdf [-h] --input INPUT [--output OUTPUT] [--no_filter] [--size SIZE]

txt-from-pdf CLI - Extracts cleaned text from PDF files

options:
  -h, --help       show this help message and exit
  --input INPUT    Path to a folder containing PDFs or to a single PDF file. (Required)
  --output OUTPUT  Output location for the extracted text files. (Optional, default: 'extracted_text')
  --no_filter      Turn off cleaning the resulting text files. (Optional)
  --size SIZE      Maximum file size per page in bytes for processing (mostly images). (Optional, default: 300000)
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/alisafaya/txt-from-pdf",
    "name": "txt-from-pdf",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7.0",
    "maintainer_email": null,
    "keywords": "pdf text extraction",
    "author": "Ali Safaya",
    "author_email": "alisafaya@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/18/56/b1de6b8a8a28d07204947398eba216cd84031c24e1e2c002444ac18c973f/txt-from-pdf-1.1.0.tar.gz",
    "platform": null,
    "description": "\n# txt-from-pdf: Extract clean text from PDFs\n\n[![Github release](https://img.shields.io/github/release/alisafaya/txt-from-pdf.svg)](https://github.com/alisafaya/txt-from-pdf/releases)\n[![PyPI version](https://badge.fury.io/py/txt-from-pdf.svg)](https://badge.fury.io/py/txt-from-pdf)\n[![GitHub license](https://img.shields.io/github/license/alisafaya/txt-from-pdf.svg)](./LICENSE)\n\nExtracting text from pdfs using [pdfminer.six](https://github.com/pdfminer/pdfminer.six) and [pypdf](https://github.com/py-pdf/pypdf/). Adapted from [PDFextract](https://github.com/sdtblck/PDFextract).\n\n# Installation\n\n```bash\npip install txt-from-pdf\n```\n\n# Usage\n\n```python\nfrom txtfrompdf import extract_txt_from_pdf\n\npdf_path = \"file.pdf\"\ntext = extract_txt_from_pdf(pdf_path)\nprint(text)\n```\n\n# CLI Usage\n\nSingle file:\n\n```bash\ntxt-from-pdf --input file.pdf --output extracted-text \n```\n\nMultiple files in a directory:\n\n```bash\ntxt-from-pdf --input dir-with-pdfs --output extracted-text \n```\n\nDetailed help:\n\n```bash\nusage: txt-from-pdf [-h] --input INPUT [--output OUTPUT] [--no_filter] [--size SIZE]\n\ntxt-from-pdf CLI - Extracts cleaned text from PDF files\n\noptions:\n  -h, --help       show this help message and exit\n  --input INPUT    Path to a folder containing PDFs or to a single PDF file. (Required)\n  --output OUTPUT  Output location for the extracted text files. (Optional, default: 'extracted_text')\n  --no_filter      Turn off cleaning the resulting text files. (Optional)\n  --size SIZE      Maximum file size per page in bytes for processing (mostly images). (Optional, default: 300000)\n```\n",
    "bugtrack_url": null,
    "license": "Apache",
    "summary": "Extract clean text from PDFs.",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://github.com/alisafaya/txt-from-pdf"
    },
    "split_keywords": [
        "pdf",
        "text",
        "extraction"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0f1defa94b7d3f325813fb89e463aab9545fbc11ea730f31a9d86ea9e83d8daf",
                "md5": "9d1fe62de57ea99f8d70789d145ee689",
                "sha256": "b2d96b64c1d49d627d57b0c6d08cb08956b67c3eff835d20f33a22b38eee441a"
            },
            "downloads": -1,
            "filename": "txt_from_pdf-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9d1fe62de57ea99f8d70789d145ee689",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7.0",
            "size": 14508,
            "upload_time": "2024-05-03T15:40:06",
            "upload_time_iso_8601": "2024-05-03T15:40:06.963468Z",
            "url": "https://files.pythonhosted.org/packages/0f/1d/efa94b7d3f325813fb89e463aab9545fbc11ea730f31a9d86ea9e83d8daf/txt_from_pdf-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1856b1de6b8a8a28d07204947398eba216cd84031c24e1e2c002444ac18c973f",
                "md5": "97cba2283d2b5e9676e3cf7273a81151",
                "sha256": "1ae509e8401e7f1c31f7a12e50bac1321ad200adbdb3ed32d1d3888be679e95e"
            },
            "downloads": -1,
            "filename": "txt-from-pdf-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "97cba2283d2b5e9676e3cf7273a81151",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7.0",
            "size": 13661,
            "upload_time": "2024-05-03T15:40:09",
            "upload_time_iso_8601": "2024-05-03T15:40:09.713540Z",
            "url": "https://files.pythonhosted.org/packages/18/56/b1de6b8a8a28d07204947398eba216cd84031c24e1e2c002444ac18c973f/txt-from-pdf-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-03 15:40:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "alisafaya",
    "github_project": "txt-from-pdf",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "txt-from-pdf"
}
        
Elapsed time: 0.28519s