hotpdf


Namehotpdf JSON
Version 0.4.6.1 PyPI version JSON
download
home_page
SummaryFast PDF Data Extraction library
upload_time2024-02-22 15:45:53
maintainer
docs_urlNone
author
requires_python>=3.9
licenseMIT License Copyright (c) 2024 Prestatech GmbH Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords pdf data extraction text extraction hotpdf pdfminer pdfquery
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # hotpdf

[![Documentation Status](https://readthedocs.org/projects/hotpdf/badge/?version=latest)](https://hotpdf.readthedocs.io/en/latest/?badge=latest)
[![latest](https://github.com/weareprestatech/hotpdf/actions/workflows/python-publish.yml/badge.svg)](https://github.com/weareprestatech/hotpdf/actions/workflows/python-publish.yml)
[![build](https://github.com/weareprestatech/hotpdf/actions/workflows/build-badge.yml/badge.svg)](https://github.com/weareprestatech/hotpdf/actions/workflows/build-badge.yml)
[![Coverage Status](https://coveralls.io/repos/github/weareprestatech/hotpdf/badge.svg?branch=main)](https://coveralls.io/github/weareprestatech/hotpdf?branch=main)
[![Unit tests](https://github.com/weareprestatech/hotpdf/actions/workflows/test.yml/badge.svg?branch=main)](https://github.com/weareprestatech/hotpdf/actions/workflows/test.yml)


This project was started as an internal project @ [Prestatech](http://prestatech.com/) to parse PDF files in a fast and memory efficient way to overcome the difficulties we were having while parsing big PDF files using libraries such as [pdfquery](https://github.com/jcushman/pdfquery).

hotpdf is a wrapper around [pdfminer.six](https://github.com/pdfminer/pdfminer.six) focusing on text extraction and text search operations on PDFs.

hotpdf can be used to find and extract text from PDFs.
Please [read the docs](https://hotpdf.readthedocs.io/en/latest/) to understand how the library can help you!

## Installation

The latest version of hotpdf can be installed directly from [PyPI](https://pypi.org/project/hotpdf/) with pip.

```bash
pip install hotpdf
```

## Local Setup

First install the dependencies required by hotpdf

```bash
python3 -m pip install -e .
```

### Contributing

You should install the [pre-commit](https://github.com/weareprestatech/hotpdf/blob/main/.pre-commit-config.yaml) hooks with `pre-commit install`. This will run the linter, mypy, and ruff formatting before each commit.

Rembember to run `pip install -e '.[dev]'` to install the extra dependencies for development.

For more examples of how to run the full test suite please refer to the [CI workflow](https://github.com/weareprestatech/hotpdf/blob/main/.github/workflows/test.yml).

We strive to keep the test coverage at 100% (but can't due to certain reasons - e.g., test file not available): if you want your contributions accepted please write tests for them :D

Some examples of running tests locally:

```bash
python3 -m pip install -e '.[dev]'               # install extra deps for testing
python3 -m pytest -n=auto tests/                      # run the test suite
# run tests with coverage
python3 -m pytest --cov-fail-under=90 -n=auto --cov=hotpdf --cov-report term-missing
```

### Documentation

We use [sphinx](https://www.sphinx-doc.org/en/master/) for generating our docs and host them on [readthedocs](https://readthedocs.org/)

Please update and add documentation if required, with your contributions.

Update the `.rst` files, rebuild them, and commit them along with your PRs.

```bash
cd docs
make clean
make html
```

This will generate the necessary documentation files. Once merged to `main` the docs will be updated automatically.

## Usage

**To view more detailed usage information, please [read the docs](https://hotpdf.readthedocs.io/en/latest/)**

Basic usage is as follows:

```python

from hotpdf import HotPdf

pdf_file_path = "test.pdf"

# Load pdf file into memory
hotpdf_document = HotPdf(pdf_file_path)

# Alternatively, you can also pass an opened pdf stream to be loaded
with open(pdf_file_path, "rb") as f:
   hotpdf_document_2 = HotPdf(f)

# Sometimes pdfminer will not replace (cid:x) values properly
# In that case pass EncodingTypes
from hotpdf.encodings.types import EncodingTypes
hotpdf_cid_removal_object = HotPdf(f, cid_overwrite_charset=EncodingTypes.LATIN)

# You can also merge multiple HotPdf objects to get one single HotPdf object
merged_hotpdf_object = HotPdf.merge_multiple(hotpdfs=[hotpdf1, hotpdf2])

# Get number of pages
print(len(hotpdf_document.pages))

# Find text
text_occurences = hotpdf_document.find_text("foo")

# Find text and its full span
text_occurences_full_span = hotpdf_document.find_text("foo", take_span=True)

# Extract text in region
text_in_bbox = hotpdf_document.extract_text(
   x0=0,
   y0=0,
   x1=100,
   y1=10,
   page=0,
)

# Extract spans in region
spans_in_bbox = hotpdf_document.extract_spans(
   x0=0,
   y0=0,
   x1=100,
   y1=10,
   page=0,
)

# Extract spans text in region
spans_text_in_bbox = hotpdf_document.extract_spans_text(
   x0=0,
   y0=0,
   x1=100,
   y1=10,
   page=0,
)

# Extract full page text
full_page_text = hotpdf_document.extract_page_text(page=0)
```

## License

This project is licensed under the terms of the MIT license.

---
with ❤️ from the team @ [Prestatech GmbH](https://prestatech.com/)

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "hotpdf",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "Alex Ptakhin <alex.ptakhin@prestatech.com>, Izel Odabasi <izel.odabasi@prestatech.com>, Mattia Callegari <callegari.mattia@protonmail.com>",
    "keywords": "pdf,data extraction,text extraction,hotpdf,pdfminer,pdfquery",
    "author": "",
    "author_email": "Krishnasis Mandal <krishnasis.mandal@prestatech.com>",
    "download_url": "https://files.pythonhosted.org/packages/fd/4b/9df1987988de725de3ba88c5fd26f128103ee70dd43ffba30d97d172d362/hotpdf-0.4.6.1.tar.gz",
    "platform": null,
    "description": "# hotpdf\n\n[![Documentation Status](https://readthedocs.org/projects/hotpdf/badge/?version=latest)](https://hotpdf.readthedocs.io/en/latest/?badge=latest)\n[![latest](https://github.com/weareprestatech/hotpdf/actions/workflows/python-publish.yml/badge.svg)](https://github.com/weareprestatech/hotpdf/actions/workflows/python-publish.yml)\n[![build](https://github.com/weareprestatech/hotpdf/actions/workflows/build-badge.yml/badge.svg)](https://github.com/weareprestatech/hotpdf/actions/workflows/build-badge.yml)\n[![Coverage Status](https://coveralls.io/repos/github/weareprestatech/hotpdf/badge.svg?branch=main)](https://coveralls.io/github/weareprestatech/hotpdf?branch=main)\n[![Unit tests](https://github.com/weareprestatech/hotpdf/actions/workflows/test.yml/badge.svg?branch=main)](https://github.com/weareprestatech/hotpdf/actions/workflows/test.yml)\n\n\nThis project was started as an internal project @ [Prestatech](http://prestatech.com/) to parse PDF files in a fast and memory efficient way to overcome the difficulties we were having while parsing big PDF files using libraries such as [pdfquery](https://github.com/jcushman/pdfquery).\n\nhotpdf is a wrapper around [pdfminer.six](https://github.com/pdfminer/pdfminer.six) focusing on text extraction and text search operations on PDFs.\n\nhotpdf can be used to find and extract text from PDFs.\nPlease [read the docs](https://hotpdf.readthedocs.io/en/latest/) to understand how the library can help you!\n\n## Installation\n\nThe latest version of hotpdf can be installed directly from [PyPI](https://pypi.org/project/hotpdf/) with pip.\n\n```bash\npip install hotpdf\n```\n\n## Local Setup\n\nFirst install the dependencies required by hotpdf\n\n```bash\npython3 -m pip install -e .\n```\n\n### Contributing\n\nYou should install the [pre-commit](https://github.com/weareprestatech/hotpdf/blob/main/.pre-commit-config.yaml) hooks with `pre-commit install`. This will run the linter, mypy, and ruff formatting before each commit.\n\nRembember to run `pip install -e '.[dev]'` to install the extra dependencies for development.\n\nFor more examples of how to run the full test suite please refer to the [CI workflow](https://github.com/weareprestatech/hotpdf/blob/main/.github/workflows/test.yml).\n\nWe strive to keep the test coverage at 100% (but can't due to certain reasons - e.g., test file not available): if you want your contributions accepted please write tests for them :D\n\nSome examples of running tests locally:\n\n```bash\npython3 -m pip install -e '.[dev]'               # install extra deps for testing\npython3 -m pytest -n=auto tests/                      # run the test suite\n# run tests with coverage\npython3 -m pytest --cov-fail-under=90 -n=auto --cov=hotpdf --cov-report term-missing\n```\n\n### Documentation\n\nWe use [sphinx](https://www.sphinx-doc.org/en/master/) for generating our docs and host them on [readthedocs](https://readthedocs.org/)\n\nPlease update and add documentation if required, with your contributions.\n\nUpdate the `.rst` files, rebuild them, and commit them along with your PRs.\n\n```bash\ncd docs\nmake clean\nmake html\n```\n\nThis will generate the necessary documentation files. Once merged to `main` the docs will be updated automatically.\n\n## Usage\n\n**To view more detailed usage information, please [read the docs](https://hotpdf.readthedocs.io/en/latest/)**\n\nBasic usage is as follows:\n\n```python\n\nfrom hotpdf import HotPdf\n\npdf_file_path = \"test.pdf\"\n\n# Load pdf file into memory\nhotpdf_document = HotPdf(pdf_file_path)\n\n# Alternatively, you can also pass an opened pdf stream to be loaded\nwith open(pdf_file_path, \"rb\") as f:\n   hotpdf_document_2 = HotPdf(f)\n\n# Sometimes pdfminer will not replace (cid:x) values properly\n# In that case pass EncodingTypes\nfrom hotpdf.encodings.types import EncodingTypes\nhotpdf_cid_removal_object = HotPdf(f, cid_overwrite_charset=EncodingTypes.LATIN)\n\n# You can also merge multiple HotPdf objects to get one single HotPdf object\nmerged_hotpdf_object = HotPdf.merge_multiple(hotpdfs=[hotpdf1, hotpdf2])\n\n# Get number of pages\nprint(len(hotpdf_document.pages))\n\n# Find text\ntext_occurences = hotpdf_document.find_text(\"foo\")\n\n# Find text and its full span\ntext_occurences_full_span = hotpdf_document.find_text(\"foo\", take_span=True)\n\n# Extract text in region\ntext_in_bbox = hotpdf_document.extract_text(\n   x0=0,\n   y0=0,\n   x1=100,\n   y1=10,\n   page=0,\n)\n\n# Extract spans in region\nspans_in_bbox = hotpdf_document.extract_spans(\n   x0=0,\n   y0=0,\n   x1=100,\n   y1=10,\n   page=0,\n)\n\n# Extract spans text in region\nspans_text_in_bbox = hotpdf_document.extract_spans_text(\n   x0=0,\n   y0=0,\n   x1=100,\n   y1=10,\n   page=0,\n)\n\n# Extract full page text\nfull_page_text = hotpdf_document.extract_page_text(page=0)\n```\n\n## License\n\nThis project is licensed under the terms of the MIT license.\n\n---\nwith \u2764\ufe0f from the team @ [Prestatech GmbH](https://prestatech.com/)\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 Prestatech GmbH  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Fast PDF Data Extraction library",
    "version": "0.4.6.1",
    "project_urls": null,
    "split_keywords": [
        "pdf",
        "data extraction",
        "text extraction",
        "hotpdf",
        "pdfminer",
        "pdfquery"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "82bd9333bd48fc360c285a08bd0dca6a0aaaf61f291fbbd0624f49e359459872",
                "md5": "42b04c45629330ab5d330ed1a47858f8",
                "sha256": "9083077d9e5acdc3da2f3f216b41d203ee533593696c64cf8f0d117769014792"
            },
            "downloads": -1,
            "filename": "hotpdf-0.4.6.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "42b04c45629330ab5d330ed1a47858f8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 19525,
            "upload_time": "2024-02-22T15:45:52",
            "upload_time_iso_8601": "2024-02-22T15:45:52.056939Z",
            "url": "https://files.pythonhosted.org/packages/82/bd/9333bd48fc360c285a08bd0dca6a0aaaf61f291fbbd0624f49e359459872/hotpdf-0.4.6.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fd4b9df1987988de725de3ba88c5fd26f128103ee70dd43ffba30d97d172d362",
                "md5": "8be6058bd6bb4789dca84a638112c046",
                "sha256": "480df3886da444d40c969c023ffa4acc140449bf1276e5fe81a4279870a62fe1"
            },
            "downloads": -1,
            "filename": "hotpdf-0.4.6.1.tar.gz",
            "has_sig": false,
            "md5_digest": "8be6058bd6bb4789dca84a638112c046",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 23454,
            "upload_time": "2024-02-22T15:45:53",
            "upload_time_iso_8601": "2024-02-22T15:45:53.346050Z",
            "url": "https://files.pythonhosted.org/packages/fd/4b/9df1987988de725de3ba88c5fd26f128103ee70dd43ffba30d97d172d362/hotpdf-0.4.6.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-22 15:45:53",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "hotpdf"
}
        
Elapsed time: 0.19265s