<div id="top" align="left">
<!-- HEADER -->
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/eli64s/pdflex/656aa96e7c4b65ca72077d170e4dcdbdd9bbbc45/docs/assets/logo-dark.svg">
<source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/eli64s/pdflex/656aa96e7c4b65ca72077d170e4dcdbdd9bbbc45/docs/assets/logo-light.svg">
<img alt="pdflex Logo" src="https://raw.githubusercontent.com/eli64s/pdflex/656aa96e7c4b65ca72077d170e4dcdbdd9bbbc45/docs/assets/logo-light.svg" width="100%" style="max-width: 100%;">
</picture>
<!-- BADGES -->
<div align="left">
<p align="left" style="margin-bottom: 20px;">
<a href="https://github.com/eli64s/pdflex/actions">
<img src="https://img.shields.io/github/actions/workflow/status/eli64s/pdflex/ci.yml?label=CI&style=flat&logo=githubactions&logoColor=white&labelColor=2A2A2A&color=FF1493" alt="GitHub Actions" />
</a>
<a href="https://app.codecov.io/gh/eli64s/pdflex">
<img src="https://img.shields.io/codecov/c/github/eli64s/pdflex?label=Coverage&style=flat&logo=codecov&logoColor=white&labelColor=2A2A2A&color=00F5FF" alt="Coverage" />
</a>
<a href="https://pypi.org/project/pdflex/">
<img src="https://img.shields.io/pypi/v/pdflex?label=PyPI&style=flat&logo=pypi&logoColor=white&labelColor=2A2A2A&color=3d8be1" alt="PyPI Version" />
</a>
<a href="https://github.com/eli64s/pdflex">
<img src="https://img.shields.io/pypi/pyversions/pdflex?label=Python&style=flat&logo=python&logoColor=white&labelColor=2A2A2A&color=9b26d4" alt="Python Version" />
</a>
<a href="https://opensource.org/license/mit/">
<img src="https://img.shields.io/github/license/eli64s/pdflex?label=License&style=flat&logo=opensourceinitiative&logoColor=white&labelColor=2A2A2A&color=4B0082" alt="MIT License">
</a>
</p>
</div>
<div align="left">
<img src="https://raw.githubusercontent.com/eli64s/pdflex/d545ac98f5ad59ece892e638a7d3bdee593d8e88/docs/assets/line.svg" alt="thematic-break" width="100%" height="2px" style="margin: 20px 0;">
</div>
</div>
## What is `PDFlex?`
PDFlex is a powerful PDF processing toolkit for Python. It provides robust tools for PDF validation, text extraction, merging (with custom separator pages), searching, and moreāall built to streamline your PDF automation workflows.
## Features
- **PDF Validation:** Quickly verify if a file is a valid PDF.
- **Text Extraction:** Extract text from PDFs using either PyMuPDF or PyPDF.
- **Directory Processing:** Process entire directories of PDFs for text extraction.
- **PDF Merging:** Merge multiple PDF files into one, automatically inserting a custom separator page between documents.
- The separator page displays the title (derived from the filename) with underscores and hyphens removed.
- Supports both portrait and landscape separator pages (ideal for lecture slides).
- **PDF Searching:** Recursively search for PDFs in a directory based on filename patterns (e.g., numeric float prefixes).
<!-- ## Documentation
Full documentation is available at [https://pdflex.readthedocs.io/](https://pdflex.readthedocs.io/)
- [User Guide](https://pdflex.readthedocs.io/en/latest/user_guide.html)
- [API Reference](https://pdflex.readthedocs.io/en/latest/api.html)
- [Examples](https://pdflex.readthedocs.io/en/latest/examples.html) -->
---
## Quick Start
## Installation
PDFlex is available on PyPI. To install using pip:
```bash
pip install -U pdflex
```
Alternatively, install in an isolated environment with pipx:
```bash
pipx install pdflex
```
For the fastest installation using uv:
```bash
uv tool install pdflex
```
---
## Usage
### Command-Line Interface (CLI)
PDFlex provides a convenient CLI for merging and searching PDFs. The CLI supports two primary commands: `merge` and `search`.
#### Merge Command
Merge multiple PDF files into a single document while automatically inserting a separator page before each document.
**Usage:**
```bash
pdflex merge /path/to/file1.pdf /path/to/file2.pdf -o merged_output.pdf
```
Add the `--landscape` flag to create separator pages in landscape orientation:
```bash
pdflex merge /path/to/file1.pdf /path/to/file2.pdf -o merged_output.pdf --landscape
```
#### Search and Merge Command
Search for PDF files in a directory based on filename filters (or search for lecture slides with numeric float prefixes) and merge them into one PDF.
**Usage:**
- **General Search:**
```bash
pdflex search /path/to/search -o merged_output.pdf --prefix "Chapter" --suffix ".pdf"
```
- **Lecture Slides Merge:**
(Merges all PDFs whose filenames start with a numeric float prefix like `1.2_`, `3.2_`, etc., in sorted order. Separator pages will be in landscape orientation.)
```bash
pdflex search /path/to/algorithms-and-computation -o merged_lectures.pdf --lecture
```
### Python API Usage
You can also use PDFlex directly from your Python code. Below are examples for some common tasks.
#### Merging PDFs with Separator Pages
```python
from pathlib import Path
from pdflex.merge import merge_pdfs
# List of PDF file paths to merge
pdf_files = [
"/path/to/document1.pdf",
"/path/to/document2.pdf"
]
# Merge files, using landscape separator pages (ideal for lecture slides)
merge_pdfs(pdf_files, output_path="merged_output.pdf", landscape=True)
```
#### Searching for PDFs by Filename
```python
from pdflex.search import search_pdfs, search_numeric_prefixed_pdfs
# General search: Find PDFs that start with a prefix and/or end with a suffix
pdf_list = search_pdfs("/path/to/search", prefix="Chapter", suffix=".pdf")
print("Found PDFs:", pdf_list)
# Lecture slides: Find PDFs with numeric float prefixes (e.g., "1.2_Intro.pdf")
lecture_slides = search_numeric_prefixed_pdfs("/path/to/algorithms-and-computation")
print("Found lecture slides:", lecture_slides)
```
<!--
#### Extracting Text from a PDF
```python
from pdflex import extract_text_from_pdf
# Extract text from a PDF using the auto-detection method (tries PyMuPDF then falls back to PyPDF)
output_txt = extract_text_from_pdf("invoice.pdf", method="auto")
print(f"Extracted text saved to: {output_txt}")
```
#### Processing an Entire Directory
```python
from pdflex import process_directory
# Process all PDFs in a directory and extract their text to corresponding .txt files.
process_directory("/path/to/pdf_directory", output_dir="/path/to/text_outputs")
```
---
## API Reference
For detailed API documentation, please refer to the [API Reference](https://pdflex.readthedocs.io/en/latest/api.html).
### Exceptions
- **PDFlexError:** Raised for any error during PDF processing (e.g., invalid PDF, extraction failure).
### Modules Overview
- **`pdflex.merge`**
Contains functions to merge PDFs, insert separator pages (with customizable orientation and title cleaning), and write the final merged document.
- **`pdflex.search`**
Provides functions to recursively search for PDFs in a directory based on filename patterns, including numeric float prefixes for lecture slides.
- **`pdflex.extract`** (and similar)
Functions for extracting text using PyMuPDF or PyPDF, validating PDF files, and processing directories of PDFs.
- **`pdflex.cli`**
Command-line interface that exposes the `merge` and `search` commands, complete with rich console output.
-->
---
## Contributing
Contributions are welcome! Whether it's bug reports, feature requests, or code contributions, please feel free to:
1. Open an [issue][github-issues]
2. Submit a [pull request][github-pulls]
3. Improve documentation.
4. Share your ideas!
---
## Acknowledgments
This project is built upon several awesome PDF open-source projects:
- [pypdf](https://github.com/pymupdf/PyMuPDF)
- [pdfplumber](https://github.com/jsvine/pdfplumber)
- [reportlab](https://www.reportlab.com/opensource/)
---
## License
PDFlex is released under the [MIT][mit-license] license. <br />
Copyright (c) 2020 to present [PDFlex][pdflex] and contributors.
<div align="left">
<a href="#top">
<img src="https://raw.githubusercontent.com/eli64s/pdflex/607d295f58914fc81a5b71fd994af90901b6433c/docs/assets/button.svg" width="100px" height="100px" alt="Return to Top">
</a>
</div>
<div align="left">
<img src="https://raw.githubusercontent.com/eli64s/pdflex/d545ac98f5ad59ece892e638a7d3bdee593d8e88/docs/assets/line.svg" alt="thematic-break" width="100%" height="2px" style="margin: 20px 0;">
</div>
<!-- REFERENCE LINKS -->
<!-- PROJECT RESOURCES -->
[pypi]: https://pypi.org/project/pdflex/
[pdflex]: https://github.com/eli64s/pdflex
[github-issues]: https://github.com/eli64s/pdflex/issues
[github-pulls]: https://github.com/eli64s/pdflex/pulls
[mit-license]: https://github.com/eli64s/pdflex/blob/main/LICENSE
[examples]: https://github.com/eli64s/pdflex/tree/main/docs/examples
<!-- DEV TOOLS -->
[python]: https://www.python.org/
[pip]: https://pip.pypa.io/en/stable/
[pipx]: https://pipx.pypa.io/stable/
[uv]: https://docs.astral.sh/uv/
[mkdocs]: https://www.mkdocs.org/
[mkdocs.yml]: https://www.mkdocs.org/user-guide/configuration/
Raw data
{
"_id": null,
"home_page": null,
"name": "pdflex",
"maintainer": null,
"docs_url": null,
"requires_python": "<=3.14,>=3.9",
"maintainer_email": null,
"keywords": "pdf-automation, pdf-converter, pdf-data-extraction, pdf-document, pdf-document-parser, pdf-document-processor, pdf-extractor, pdf-generator, pdf-library, pdf-manipulation, pdf-parser, pdf-processor, pdf-python, pdf-reader, pdf-regex, pdf-search, pdf-text-extraction, pdf-tools, python-pdf, python-pdf-tools",
"author": null,
"author_email": "Eli Salamie <egsalamie@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/40/9c/7165d2381833fde95f323e1ca0edb5477776886fb98236feb251d3d196ef/pdflex-0.1.7.tar.gz",
"platform": null,
"description": "<div id=\"top\" align=\"left\">\n\n<!-- HEADER -->\n<picture>\n <source media=\"(prefers-color-scheme: dark)\" srcset=\"https://raw.githubusercontent.com/eli64s/pdflex/656aa96e7c4b65ca72077d170e4dcdbdd9bbbc45/docs/assets/logo-dark.svg\">\n <source media=\"(prefers-color-scheme: light)\" srcset=\"https://raw.githubusercontent.com/eli64s/pdflex/656aa96e7c4b65ca72077d170e4dcdbdd9bbbc45/docs/assets/logo-light.svg\">\n <img alt=\"pdflex Logo\" src=\"https://raw.githubusercontent.com/eli64s/pdflex/656aa96e7c4b65ca72077d170e4dcdbdd9bbbc45/docs/assets/logo-light.svg\" width=\"100%\" style=\"max-width: 100%;\">\n</picture>\n\n<!-- BADGES -->\n<div align=\"left\">\n <p align=\"left\" style=\"margin-bottom: 20px;\">\n <a href=\"https://github.com/eli64s/pdflex/actions\">\n <img src=\"https://img.shields.io/github/actions/workflow/status/eli64s/pdflex/ci.yml?label=CI&style=flat&logo=githubactions&logoColor=white&labelColor=2A2A2A&color=FF1493\" alt=\"GitHub Actions\" />\n </a>\n <a href=\"https://app.codecov.io/gh/eli64s/pdflex\">\n <img src=\"https://img.shields.io/codecov/c/github/eli64s/pdflex?label=Coverage&style=flat&logo=codecov&logoColor=white&labelColor=2A2A2A&color=00F5FF\" alt=\"Coverage\" />\n </a>\n <a href=\"https://pypi.org/project/pdflex/\">\n <img src=\"https://img.shields.io/pypi/v/pdflex?label=PyPI&style=flat&logo=pypi&logoColor=white&labelColor=2A2A2A&color=3d8be1\" alt=\"PyPI Version\" />\n </a>\n <a href=\"https://github.com/eli64s/pdflex\">\n <img src=\"https://img.shields.io/pypi/pyversions/pdflex?label=Python&style=flat&logo=python&logoColor=white&labelColor=2A2A2A&color=9b26d4\" alt=\"Python Version\" />\n </a>\n <a href=\"https://opensource.org/license/mit/\">\n <img src=\"https://img.shields.io/github/license/eli64s/pdflex?label=License&style=flat&logo=opensourceinitiative&logoColor=white&labelColor=2A2A2A&color=4B0082\" alt=\"MIT License\">\n </a>\n </p>\n</div>\n\n<div align=\"left\">\n <img src=\"https://raw.githubusercontent.com/eli64s/pdflex/d545ac98f5ad59ece892e638a7d3bdee593d8e88/docs/assets/line.svg\" alt=\"thematic-break\" width=\"100%\" height=\"2px\" style=\"margin: 20px 0;\">\n</div>\n\n</div>\n\n## What is `PDFlex?`\n\nPDFlex is a powerful PDF processing toolkit for Python. It provides robust tools for PDF validation, text extraction, merging (with custom separator pages), searching, and more\u2014all built to streamline your PDF automation workflows.\n\n## Features\n\n- **PDF Validation:** Quickly verify if a file is a valid PDF.\n- **Text Extraction:** Extract text from PDFs using either PyMuPDF or PyPDF.\n- **Directory Processing:** Process entire directories of PDFs for text extraction.\n- **PDF Merging:** Merge multiple PDF files into one, automatically inserting a custom separator page between documents.\n - The separator page displays the title (derived from the filename) with underscores and hyphens removed.\n - Supports both portrait and landscape separator pages (ideal for lecture slides).\n- **PDF Searching:** Recursively search for PDFs in a directory based on filename patterns (e.g., numeric float prefixes).\n\n\n<!-- ## Documentation\n\nFull documentation is available at [https://pdflex.readthedocs.io/](https://pdflex.readthedocs.io/)\n\n- [User Guide](https://pdflex.readthedocs.io/en/latest/user_guide.html)\n- [API Reference](https://pdflex.readthedocs.io/en/latest/api.html)\n- [Examples](https://pdflex.readthedocs.io/en/latest/examples.html) -->\n\n---\n\n## Quick Start\n\n## Installation\n\nPDFlex is available on PyPI. To install using pip:\n\n```bash\npip install -U pdflex\n```\n\nAlternatively, install in an isolated environment with pipx:\n\n```bash\npipx install pdflex\n```\n\nFor the fastest installation using uv:\n\n```bash\nuv tool install pdflex\n```\n\n---\n\n## Usage\n\n### Command-Line Interface (CLI)\n\nPDFlex provides a convenient CLI for merging and searching PDFs. The CLI supports two primary commands: `merge` and `search`.\n\n#### Merge Command\n\nMerge multiple PDF files into a single document while automatically inserting a separator page before each document.\n\n**Usage:**\n\n```bash\npdflex merge /path/to/file1.pdf /path/to/file2.pdf -o merged_output.pdf\n```\n\nAdd the `--landscape` flag to create separator pages in landscape orientation:\n\n```bash\npdflex merge /path/to/file1.pdf /path/to/file2.pdf -o merged_output.pdf --landscape\n```\n\n#### Search and Merge Command\n\nSearch for PDF files in a directory based on filename filters (or search for lecture slides with numeric float prefixes) and merge them into one PDF.\n\n**Usage:**\n\n- **General Search:**\n\n ```bash\n pdflex search /path/to/search -o merged_output.pdf --prefix \"Chapter\" --suffix \".pdf\"\n ```\n\n- **Lecture Slides Merge:**\n (Merges all PDFs whose filenames start with a numeric float prefix like `1.2_`, `3.2_`, etc., in sorted order. Separator pages will be in landscape orientation.)\n\n ```bash\n pdflex search /path/to/algorithms-and-computation -o merged_lectures.pdf --lecture\n ```\n\n### Python API Usage\n\nYou can also use PDFlex directly from your Python code. Below are examples for some common tasks.\n\n#### Merging PDFs with Separator Pages\n\n```python\nfrom pathlib import Path\nfrom pdflex.merge import merge_pdfs\n\n# List of PDF file paths to merge\npdf_files = [\n \"/path/to/document1.pdf\",\n \"/path/to/document2.pdf\"\n]\n\n# Merge files, using landscape separator pages (ideal for lecture slides)\nmerge_pdfs(pdf_files, output_path=\"merged_output.pdf\", landscape=True)\n```\n\n#### Searching for PDFs by Filename\n\n```python\nfrom pdflex.search import search_pdfs, search_numeric_prefixed_pdfs\n\n# General search: Find PDFs that start with a prefix and/or end with a suffix\npdf_list = search_pdfs(\"/path/to/search\", prefix=\"Chapter\", suffix=\".pdf\")\nprint(\"Found PDFs:\", pdf_list)\n\n# Lecture slides: Find PDFs with numeric float prefixes (e.g., \"1.2_Intro.pdf\")\nlecture_slides = search_numeric_prefixed_pdfs(\"/path/to/algorithms-and-computation\")\nprint(\"Found lecture slides:\", lecture_slides)\n```\n\n<!--\n#### Extracting Text from a PDF\n\n```python\nfrom pdflex import extract_text_from_pdf\n\n# Extract text from a PDF using the auto-detection method (tries PyMuPDF then falls back to PyPDF)\noutput_txt = extract_text_from_pdf(\"invoice.pdf\", method=\"auto\")\nprint(f\"Extracted text saved to: {output_txt}\")\n```\n\n#### Processing an Entire Directory\n\n```python\nfrom pdflex import process_directory\n\n# Process all PDFs in a directory and extract their text to corresponding .txt files.\nprocess_directory(\"/path/to/pdf_directory\", output_dir=\"/path/to/text_outputs\")\n```\n\n---\n\n## API Reference\n\nFor detailed API documentation, please refer to the [API Reference](https://pdflex.readthedocs.io/en/latest/api.html).\n\n### Exceptions\n\n- **PDFlexError:** Raised for any error during PDF processing (e.g., invalid PDF, extraction failure).\n\n### Modules Overview\n\n- **`pdflex.merge`**\n Contains functions to merge PDFs, insert separator pages (with customizable orientation and title cleaning), and write the final merged document.\n\n- **`pdflex.search`**\n Provides functions to recursively search for PDFs in a directory based on filename patterns, including numeric float prefixes for lecture slides.\n\n- **`pdflex.extract`** (and similar)\n Functions for extracting text using PyMuPDF or PyPDF, validating PDF files, and processing directories of PDFs.\n\n- **`pdflex.cli`**\n Command-line interface that exposes the `merge` and `search` commands, complete with rich console output.\n-->\n\n---\n\n## Contributing\n\nContributions are welcome! Whether it's bug reports, feature requests, or code contributions, please feel free to:\n\n1. Open an [issue][github-issues]\n2. Submit a [pull request][github-pulls]\n3. Improve documentation.\n4. Share your ideas!\n\n---\n\n## Acknowledgments\n\nThis project is built upon several awesome PDF open-source projects:\n\n- [pypdf](https://github.com/pymupdf/PyMuPDF)\n- [pdfplumber](https://github.com/jsvine/pdfplumber)\n- [reportlab](https://www.reportlab.com/opensource/)\n\n---\n\n## License\n\nPDFlex is released under the [MIT][mit-license] license. <br />\nCopyright (c) 2020 to present [PDFlex][pdflex] and contributors.\n\n<div align=\"left\">\n <a href=\"#top\">\n <img src=\"https://raw.githubusercontent.com/eli64s/pdflex/607d295f58914fc81a5b71fd994af90901b6433c/docs/assets/button.svg\" width=\"100px\" height=\"100px\" alt=\"Return to Top\">\n </a>\n</div>\n\n<div align=\"left\">\n <img src=\"https://raw.githubusercontent.com/eli64s/pdflex/d545ac98f5ad59ece892e638a7d3bdee593d8e88/docs/assets/line.svg\" alt=\"thematic-break\" width=\"100%\" height=\"2px\" style=\"margin: 20px 0;\">\n</div>\n\n<!-- REFERENCE LINKS -->\n\n<!-- PROJECT RESOURCES -->\n[pypi]: https://pypi.org/project/pdflex/\n[pdflex]: https://github.com/eli64s/pdflex\n[github-issues]: https://github.com/eli64s/pdflex/issues\n[github-pulls]: https://github.com/eli64s/pdflex/pulls\n[mit-license]: https://github.com/eli64s/pdflex/blob/main/LICENSE\n[examples]: https://github.com/eli64s/pdflex/tree/main/docs/examples\n\n<!-- DEV TOOLS -->\n[python]: https://www.python.org/\n[pip]: https://pip.pypa.io/en/stable/\n[pipx]: https://pipx.pypa.io/stable/\n[uv]: https://docs.astral.sh/uv/\n[mkdocs]: https://www.mkdocs.org/\n[mkdocs.yml]: https://www.mkdocs.org/user-guide/configuration/\n",
"bugtrack_url": null,
"license": null,
"summary": "Python tools for PDF automation.",
"version": "0.1.7",
"project_urls": {
"Homepage": "https://github.com/eli64s/pdflex"
},
"split_keywords": [
"pdf-automation",
" pdf-converter",
" pdf-data-extraction",
" pdf-document",
" pdf-document-parser",
" pdf-document-processor",
" pdf-extractor",
" pdf-generator",
" pdf-library",
" pdf-manipulation",
" pdf-parser",
" pdf-processor",
" pdf-python",
" pdf-reader",
" pdf-regex",
" pdf-search",
" pdf-text-extraction",
" pdf-tools",
" python-pdf",
" python-pdf-tools"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "20298768f46d6e971507195e740e46ff12a6416a74e23ad100c0653405a8b245",
"md5": "4eafed58363fd033e5362571addff667",
"sha256": "79792d6143d1b1888b3bf2385531e0ff26fb4f6cb24c63e1bebabd809d1dc60a"
},
"downloads": -1,
"filename": "pdflex-0.1.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4eafed58363fd033e5362571addff667",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<=3.14,>=3.9",
"size": 14236,
"upload_time": "2025-02-18T15:58:03",
"upload_time_iso_8601": "2025-02-18T15:58:03.318672Z",
"url": "https://files.pythonhosted.org/packages/20/29/8768f46d6e971507195e740e46ff12a6416a74e23ad100c0653405a8b245/pdflex-0.1.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "409c7165d2381833fde95f323e1ca0edb5477776886fb98236feb251d3d196ef",
"md5": "f0b76be79e6134d708b19beb987a3b69",
"sha256": "16cf41d2de3b3b71e54962c25a24899d7fcde30afd2bd54b40e1421adad0263d"
},
"downloads": -1,
"filename": "pdflex-0.1.7.tar.gz",
"has_sig": false,
"md5_digest": "f0b76be79e6134d708b19beb987a3b69",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<=3.14,>=3.9",
"size": 304646,
"upload_time": "2025-02-18T15:58:05",
"upload_time_iso_8601": "2025-02-18T15:58:05.420888Z",
"url": "https://files.pythonhosted.org/packages/40/9c/7165d2381833fde95f323e1ca0edb5477776886fb98236feb251d3d196ef/pdflex-0.1.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-18 15:58:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "eli64s",
"github_project": "pdflex",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pdflex"
}