txt-from-pdf

Name	txt-from-pdf JSON
Version	1.3.1 JSON
	download
home_page	https://github.com/alisafaya/txt-from-pdf
Summary	Extract clean text from PDFs.
upload_time	2024-07-27 16:29:12
maintainer	None
docs_url	None
author	Ali Safaya
requires_python	>=3.7.0
license	Apache
keywords	pdf text extraction
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            
# txt-from-pdf: Extract clean text from PDFs

[![Github release](https://img.shields.io/github/release/alisafaya/txt-from-pdf.svg)](https://github.com/alisafaya/txt-from-pdf/releases)
[![PyPI version](https://badge.fury.io/py/txt-from-pdf.svg)](https://badge.fury.io/py/txt-from-pdf)
[![GitHub license](https://img.shields.io/github/license/alisafaya/txt-from-pdf.svg)](./LICENSE)

Extracting text from pdfs using [pymupdf](https://github.com/pymupdf/PyMuPDF/), but with a focus on cleaning and formatting the extracted text. 

# Installation

```bash
pip install txt-from-pdf
```

# Usage

```python
from txtfrompdf import extract_txt_from_pdf

pdf_path = "file.pdf"
text = extract_txt_from_pdf(pdf_path)
print(text)
```

# CLI Usage

Single file:

```bash
txt-from-pdf --input file.pdf --output extracted-text 
```

Multiple files in a directory:

```bash
txt-from-pdf --input dir-with-pdfs --output extracted-text 
```

Detailed help:

```bash
usage: txt-from-pdf [-h] --input INPUT [--output OUTPUT] [--no_filter] [--size SIZE]

txt-from-pdf CLI - Extracts cleaned text from PDF files

options:
  -h, --help       show this help message and exit
  --input INPUT    Path to a folder containing PDFs or to a single PDF file. (Required)
  --output OUTPUT  Output location for the extracted text files. (Optional, default: 'extracted_text')
  --no_filter      Turn off cleaning the resulting text files. (Optional)
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/alisafaya/txt-from-pdf",
    "name": "txt-from-pdf",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7.0",
    "maintainer_email": null,
    "keywords": "pdf text extraction",
    "author": "Ali Safaya",
    "author_email": "alisafaya@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/a0/f8/af79d76ae77e71de7823f8235400b05709040822f5765668d96b63b7ed52/txt-from-pdf-1.3.1.tar.gz",
    "platform": null,
    "description": "\n# txt-from-pdf: Extract clean text from PDFs\n\n[![Github release](https://img.shields.io/github/release/alisafaya/txt-from-pdf.svg)](https://github.com/alisafaya/txt-from-pdf/releases)\n[![PyPI version](https://badge.fury.io/py/txt-from-pdf.svg)](https://badge.fury.io/py/txt-from-pdf)\n[![GitHub license](https://img.shields.io/github/license/alisafaya/txt-from-pdf.svg)](./LICENSE)\n\nExtracting text from pdfs using [pymupdf](https://github.com/pymupdf/PyMuPDF/), but with a focus on cleaning and formatting the extracted text. \n\n# Installation\n\n```bash\npip install txt-from-pdf\n```\n\n# Usage\n\n```python\nfrom txtfrompdf import extract_txt_from_pdf\n\npdf_path = \"file.pdf\"\ntext = extract_txt_from_pdf(pdf_path)\nprint(text)\n```\n\n# CLI Usage\n\nSingle file:\n\n```bash\ntxt-from-pdf --input file.pdf --output extracted-text \n```\n\nMultiple files in a directory:\n\n```bash\ntxt-from-pdf --input dir-with-pdfs --output extracted-text \n```\n\nDetailed help:\n\n```bash\nusage: txt-from-pdf [-h] --input INPUT [--output OUTPUT] [--no_filter] [--size SIZE]\n\ntxt-from-pdf CLI - Extracts cleaned text from PDF files\n\noptions:\n  -h, --help       show this help message and exit\n  --input INPUT    Path to a folder containing PDFs or to a single PDF file. (Required)\n  --output OUTPUT  Output location for the extracted text files. (Optional, default: 'extracted_text')\n  --no_filter      Turn off cleaning the resulting text files. (Optional)\n```\n",
    "bugtrack_url": null,
    "license": "Apache",
    "summary": "Extract clean text from PDFs.",
    "version": "1.3.1",
    "project_urls": {
        "Homepage": "https://github.com/alisafaya/txt-from-pdf"
    },
    "split_keywords": [
        "pdf",
        "text",
        "extraction"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7e0ec517db8c0b02c708894a5d7a09b97b3a7e6b38f9fba964db7476dc5c5784",
                "md5": "02d2837c7d2db2ed29bc6fe1523150f8",
                "sha256": "c76d9355b338ec5e73f2fefc8b61514378da82dba1a68bb6388ae89d305a9f9a"
            },
            "downloads": -1,
            "filename": "txt_from_pdf-1.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "02d2837c7d2db2ed29bc6fe1523150f8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7.0",
            "size": 13826,
            "upload_time": "2024-07-27T16:29:11",
            "upload_time_iso_8601": "2024-07-27T16:29:11.061069Z",
            "url": "https://files.pythonhosted.org/packages/7e/0e/c517db8c0b02c708894a5d7a09b97b3a7e6b38f9fba964db7476dc5c5784/txt_from_pdf-1.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a0f8af79d76ae77e71de7823f8235400b05709040822f5765668d96b63b7ed52",
                "md5": "bd74b77c9748c865047795cfc9750bff",
                "sha256": "47f6fde02d3e56dd10f4339948f19dd95dfa0edcc659c530646402e03c7e8608"
            },
            "downloads": -1,
            "filename": "txt-from-pdf-1.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "bd74b77c9748c865047795cfc9750bff",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7.0",
            "size": 11667,
            "upload_time": "2024-07-27T16:29:12",
            "upload_time_iso_8601": "2024-07-27T16:29:12.898646Z",
            "url": "https://files.pythonhosted.org/packages/a0/f8/af79d76ae77e71de7823f8235400b05709040822f5765668d96b63b7ed52/txt-from-pdf-1.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-27 16:29:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "alisafaya",
    "github_project": "txt-from-pdf",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "txt-from-pdf"
}

Ali Safaya