# txt-from-pdf: Extract clean text from PDFs
[![Github release](https://img.shields.io/github/release/alisafaya/txt-from-pdf.svg)](https://github.com/alisafaya/txt-from-pdf/releases)
[![PyPI version](https://badge.fury.io/py/txt-from-pdf.svg)](https://badge.fury.io/py/txt-from-pdf)
[![GitHub license](https://img.shields.io/github/license/alisafaya/txt-from-pdf.svg)](./LICENSE)
Extracting text from pdfs using [pdfminer.six](https://github.com/pdfminer/pdfminer.six) and [pypdf](https://github.com/py-pdf/pypdf/). Adapted from [PDFextract](https://github.com/sdtblck/PDFextract).
# Installation
```bash
pip install txt-from-pdf
```
# Usage
```python
from txtfrompdf import extract_txt_from_pdf
pdf_path = "file.pdf"
text = extract_txt_from_pdf(pdf_path)
print(text)
```
# CLI Usage
Single file:
```bash
txt-from-pdf --input file.pdf --output extracted-text
```
Multiple files in a directory:
```bash
txt-from-pdf --input dir-with-pdfs --output extracted-text
```
Detailed help:
```bash
usage: txt-from-pdf [-h] --input INPUT [--output OUTPUT] [--no_filter] [--size SIZE]
txt-from-pdf CLI - Extracts cleaned text from PDF files
options:
-h, --help show this help message and exit
--input INPUT Path to a folder containing PDFs or to a single PDF file. (Required)
--output OUTPUT Output location for the extracted text files. (Optional, default: 'extracted_text')
--no_filter Turn off cleaning the resulting text files. (Optional)
--size SIZE Maximum file size per page in bytes for processing (mostly images). (Optional, default: 300000)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/alisafaya/txt-from-pdf",
"name": "txt-from-pdf",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7.0",
"maintainer_email": null,
"keywords": "pdf text extraction",
"author": "Ali Safaya",
"author_email": "alisafaya@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/18/56/b1de6b8a8a28d07204947398eba216cd84031c24e1e2c002444ac18c973f/txt-from-pdf-1.1.0.tar.gz",
"platform": null,
"description": "\n# txt-from-pdf: Extract clean text from PDFs\n\n[![Github release](https://img.shields.io/github/release/alisafaya/txt-from-pdf.svg)](https://github.com/alisafaya/txt-from-pdf/releases)\n[![PyPI version](https://badge.fury.io/py/txt-from-pdf.svg)](https://badge.fury.io/py/txt-from-pdf)\n[![GitHub license](https://img.shields.io/github/license/alisafaya/txt-from-pdf.svg)](./LICENSE)\n\nExtracting text from pdfs using [pdfminer.six](https://github.com/pdfminer/pdfminer.six) and [pypdf](https://github.com/py-pdf/pypdf/). Adapted from [PDFextract](https://github.com/sdtblck/PDFextract).\n\n# Installation\n\n```bash\npip install txt-from-pdf\n```\n\n# Usage\n\n```python\nfrom txtfrompdf import extract_txt_from_pdf\n\npdf_path = \"file.pdf\"\ntext = extract_txt_from_pdf(pdf_path)\nprint(text)\n```\n\n# CLI Usage\n\nSingle file:\n\n```bash\ntxt-from-pdf --input file.pdf --output extracted-text \n```\n\nMultiple files in a directory:\n\n```bash\ntxt-from-pdf --input dir-with-pdfs --output extracted-text \n```\n\nDetailed help:\n\n```bash\nusage: txt-from-pdf [-h] --input INPUT [--output OUTPUT] [--no_filter] [--size SIZE]\n\ntxt-from-pdf CLI - Extracts cleaned text from PDF files\n\noptions:\n -h, --help show this help message and exit\n --input INPUT Path to a folder containing PDFs or to a single PDF file. (Required)\n --output OUTPUT Output location for the extracted text files. (Optional, default: 'extracted_text')\n --no_filter Turn off cleaning the resulting text files. (Optional)\n --size SIZE Maximum file size per page in bytes for processing (mostly images). (Optional, default: 300000)\n```\n",
"bugtrack_url": null,
"license": "Apache",
"summary": "Extract clean text from PDFs.",
"version": "1.1.0",
"project_urls": {
"Homepage": "https://github.com/alisafaya/txt-from-pdf"
},
"split_keywords": [
"pdf",
"text",
"extraction"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0f1defa94b7d3f325813fb89e463aab9545fbc11ea730f31a9d86ea9e83d8daf",
"md5": "9d1fe62de57ea99f8d70789d145ee689",
"sha256": "b2d96b64c1d49d627d57b0c6d08cb08956b67c3eff835d20f33a22b38eee441a"
},
"downloads": -1,
"filename": "txt_from_pdf-1.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9d1fe62de57ea99f8d70789d145ee689",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7.0",
"size": 14508,
"upload_time": "2024-05-03T15:40:06",
"upload_time_iso_8601": "2024-05-03T15:40:06.963468Z",
"url": "https://files.pythonhosted.org/packages/0f/1d/efa94b7d3f325813fb89e463aab9545fbc11ea730f31a9d86ea9e83d8daf/txt_from_pdf-1.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1856b1de6b8a8a28d07204947398eba216cd84031c24e1e2c002444ac18c973f",
"md5": "97cba2283d2b5e9676e3cf7273a81151",
"sha256": "1ae509e8401e7f1c31f7a12e50bac1321ad200adbdb3ed32d1d3888be679e95e"
},
"downloads": -1,
"filename": "txt-from-pdf-1.1.0.tar.gz",
"has_sig": false,
"md5_digest": "97cba2283d2b5e9676e3cf7273a81151",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7.0",
"size": 13661,
"upload_time": "2024-05-03T15:40:09",
"upload_time_iso_8601": "2024-05-03T15:40:09.713540Z",
"url": "https://files.pythonhosted.org/packages/18/56/b1de6b8a8a28d07204947398eba216cd84031c24e1e2c002444ac18c973f/txt-from-pdf-1.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-03 15:40:09",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "alisafaya",
"github_project": "txt-from-pdf",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "txt-from-pdf"
}