# txt-from-pdf: Extract clean text from PDFs
[![Github release](https://img.shields.io/github/release/alisafaya/txt-from-pdf.svg)](https://github.com/alisafaya/txt-from-pdf/releases)
[![PyPI version](https://badge.fury.io/py/txt-from-pdf.svg)](https://badge.fury.io/py/txt-from-pdf)
[![GitHub license](https://img.shields.io/github/license/alisafaya/txt-from-pdf.svg)](./LICENSE)
Extracting text from pdfs using [pymupdf](https://github.com/pymupdf/PyMuPDF/), but with a focus on cleaning and formatting the extracted text.
# Installation
```bash
pip install txt-from-pdf
```
# Usage
```python
from txtfrompdf import extract_txt_from_pdf
pdf_path = "file.pdf"
text = extract_txt_from_pdf(pdf_path)
print(text)
```
# CLI Usage
Single file:
```bash
txt-from-pdf --input file.pdf --output extracted-text
```
Multiple files in a directory:
```bash
txt-from-pdf --input dir-with-pdfs --output extracted-text
```
Detailed help:
```bash
usage: txt-from-pdf [-h] --input INPUT [--output OUTPUT] [--no_filter] [--size SIZE]
txt-from-pdf CLI - Extracts cleaned text from PDF files
options:
-h, --help show this help message and exit
--input INPUT Path to a folder containing PDFs or to a single PDF file. (Required)
--output OUTPUT Output location for the extracted text files. (Optional, default: 'extracted_text')
--no_filter Turn off cleaning the resulting text files. (Optional)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/alisafaya/txt-from-pdf",
"name": "txt-from-pdf",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7.0",
"maintainer_email": null,
"keywords": "pdf text extraction",
"author": "Ali Safaya",
"author_email": "alisafaya@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/a0/f8/af79d76ae77e71de7823f8235400b05709040822f5765668d96b63b7ed52/txt-from-pdf-1.3.1.tar.gz",
"platform": null,
"description": "\n# txt-from-pdf: Extract clean text from PDFs\n\n[![Github release](https://img.shields.io/github/release/alisafaya/txt-from-pdf.svg)](https://github.com/alisafaya/txt-from-pdf/releases)\n[![PyPI version](https://badge.fury.io/py/txt-from-pdf.svg)](https://badge.fury.io/py/txt-from-pdf)\n[![GitHub license](https://img.shields.io/github/license/alisafaya/txt-from-pdf.svg)](./LICENSE)\n\nExtracting text from pdfs using [pymupdf](https://github.com/pymupdf/PyMuPDF/), but with a focus on cleaning and formatting the extracted text. \n\n# Installation\n\n```bash\npip install txt-from-pdf\n```\n\n# Usage\n\n```python\nfrom txtfrompdf import extract_txt_from_pdf\n\npdf_path = \"file.pdf\"\ntext = extract_txt_from_pdf(pdf_path)\nprint(text)\n```\n\n# CLI Usage\n\nSingle file:\n\n```bash\ntxt-from-pdf --input file.pdf --output extracted-text \n```\n\nMultiple files in a directory:\n\n```bash\ntxt-from-pdf --input dir-with-pdfs --output extracted-text \n```\n\nDetailed help:\n\n```bash\nusage: txt-from-pdf [-h] --input INPUT [--output OUTPUT] [--no_filter] [--size SIZE]\n\ntxt-from-pdf CLI - Extracts cleaned text from PDF files\n\noptions:\n -h, --help show this help message and exit\n --input INPUT Path to a folder containing PDFs or to a single PDF file. (Required)\n --output OUTPUT Output location for the extracted text files. (Optional, default: 'extracted_text')\n --no_filter Turn off cleaning the resulting text files. (Optional)\n```\n",
"bugtrack_url": null,
"license": "Apache",
"summary": "Extract clean text from PDFs.",
"version": "1.3.1",
"project_urls": {
"Homepage": "https://github.com/alisafaya/txt-from-pdf"
},
"split_keywords": [
"pdf",
"text",
"extraction"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "7e0ec517db8c0b02c708894a5d7a09b97b3a7e6b38f9fba964db7476dc5c5784",
"md5": "02d2837c7d2db2ed29bc6fe1523150f8",
"sha256": "c76d9355b338ec5e73f2fefc8b61514378da82dba1a68bb6388ae89d305a9f9a"
},
"downloads": -1,
"filename": "txt_from_pdf-1.3.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "02d2837c7d2db2ed29bc6fe1523150f8",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7.0",
"size": 13826,
"upload_time": "2024-07-27T16:29:11",
"upload_time_iso_8601": "2024-07-27T16:29:11.061069Z",
"url": "https://files.pythonhosted.org/packages/7e/0e/c517db8c0b02c708894a5d7a09b97b3a7e6b38f9fba964db7476dc5c5784/txt_from_pdf-1.3.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a0f8af79d76ae77e71de7823f8235400b05709040822f5765668d96b63b7ed52",
"md5": "bd74b77c9748c865047795cfc9750bff",
"sha256": "47f6fde02d3e56dd10f4339948f19dd95dfa0edcc659c530646402e03c7e8608"
},
"downloads": -1,
"filename": "txt-from-pdf-1.3.1.tar.gz",
"has_sig": false,
"md5_digest": "bd74b77c9748c865047795cfc9750bff",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7.0",
"size": 11667,
"upload_time": "2024-07-27T16:29:12",
"upload_time_iso_8601": "2024-07-27T16:29:12.898646Z",
"url": "https://files.pythonhosted.org/packages/a0/f8/af79d76ae77e71de7823f8235400b05709040822f5765668d96b63b7ed52/txt-from-pdf-1.3.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-27 16:29:12",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "alisafaya",
"github_project": "txt-from-pdf",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "txt-from-pdf"
}