pdf-page-annotator


Namepdf-page-annotator JSON
Version 0.3.0 PyPI version JSON
download
home_pagehttps://github.com/ChinmayShrivastava/pdf-page-annotator
SummaryA light weight library to extract the table of contents and tag them to the pages containing the content.
upload_time2024-03-23 05:23:55
maintainerNone
docs_urlNone
authorChinmay Shrivastava
requires_pythonNone
licenseGPLv3
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pdf-page-annotator
A light weight library to extract the table of contents and tag them to the pages containing the content.

To understand the structure of a PDF and for effective retrieval, it is important to understand the contents and know exactly 
what page contains what.

When the need to extract a specific subsection of the pdf comes up, it can be found in either of the two places--
1. In a section of a semi-structured (one with a structure and TOC) document.
2. In an unknown section or in a fragmented form inside an unstructured document.

For the more extreme case of unstructured document, we have to perform an analysis on the whole document. Each time we want to find some informationin an exhaustive fashion (Because naive vector retrieval can't do that).

So, for the semi-structured documents, conventionally all important PDF documents worth indexing have a TOC, we can perform an initial
TOC sweep, and extract relevant page numbers for each TOC item. In this manner, when we have to search for something exhaustively, 
instead of having to sesrch through the whole document, we can only search through the TOC to find the relevant pages, and then 
extract information from only those pages, saving time and tokens.

## Installation

```python
pip install pdf-page-annotator
```

## Usage

1. Import and initialize the PDFAnnotator class

```python
from pdf_page_annotator import PDFAnnotator
annotator = PDFAnnotator(pdf_path="path_to_your_pdf_file", verbose=True) # `verbose=True` logs progress on the console, default is `False`
```

2. Extract the contents

```python
annotator.run_extraction_pipeline()
```

3. Access the content list

```python
print(annotator.content[0].unique_title, annotator.content[0].start_page, annotator.content[0].end_page)
```

Enjoy!

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ChinmayShrivastava/pdf-page-annotator",
    "name": "pdf-page-annotator",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Chinmay Shrivastava",
    "author_email": "cshrivastava99@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/bd/c3/d808c5f668b39a19a1ccd2a300457ed44164c1262404cb693013e64afd60/pdf_page_annotator-0.3.0.tar.gz",
    "platform": null,
    "description": "# pdf-page-annotator\nA light weight library to extract the table of contents and tag them to the pages containing the content.\n\nTo understand the structure of a PDF and for effective retrieval, it is important to understand the contents and know exactly \nwhat page contains what.\n\nWhen the need to extract a specific subsection of the pdf comes up, it can be found in either of the two places--\n1. In a section of a semi-structured (one with a structure and TOC) document.\n2. In an unknown section or in a fragmented form inside an unstructured document.\n\nFor the more extreme case of unstructured document, we have to perform an analysis on the whole document. Each time we want to find some informationin an exhaustive fashion (Because naive vector retrieval can't do that).\n\nSo, for the semi-structured documents, conventionally all important PDF documents worth indexing have a TOC, we can perform an initial\nTOC sweep, and extract relevant page numbers for each TOC item. In this manner, when we have to search for something exhaustively, \ninstead of having to sesrch through the whole document, we can only search through the TOC to find the relevant pages, and then \nextract information from only those pages, saving time and tokens.\n\n## Installation\n\n```python\npip install pdf-page-annotator\n```\n\n## Usage\n\n1. Import and initialize the PDFAnnotator class\n\n```python\nfrom pdf_page_annotator import PDFAnnotator\nannotator = PDFAnnotator(pdf_path=\"path_to_your_pdf_file\", verbose=True) # `verbose=True` logs progress on the console, default is `False`\n```\n\n2. Extract the contents\n\n```python\nannotator.run_extraction_pipeline()\n```\n\n3. Access the content list\n\n```python\nprint(annotator.content[0].unique_title, annotator.content[0].start_page, annotator.content[0].end_page)\n```\n\nEnjoy!\n",
    "bugtrack_url": null,
    "license": "GPLv3",
    "summary": "A light weight library to extract the table of contents and tag them to the pages containing the content.",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "https://github.com/ChinmayShrivastava/pdf-page-annotator"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "19134ea937ca42d2531631ea5d691627f01d454e51b9b346b53785014b749f1f",
                "md5": "7d0c90d7eb8b4eaca6f4816e7455d5f5",
                "sha256": "3e2e880833b2c4eecf7cc18f78b6c6ad8a0b41b462e7cf16685a881cf8c799a2"
            },
            "downloads": -1,
            "filename": "pdf_page_annotator-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7d0c90d7eb8b4eaca6f4816e7455d5f5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 17990,
            "upload_time": "2024-03-23T05:23:54",
            "upload_time_iso_8601": "2024-03-23T05:23:54.026748Z",
            "url": "https://files.pythonhosted.org/packages/19/13/4ea937ca42d2531631ea5d691627f01d454e51b9b346b53785014b749f1f/pdf_page_annotator-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bdc3d808c5f668b39a19a1ccd2a300457ed44164c1262404cb693013e64afd60",
                "md5": "1ccf0b98f2f99dad7109c42adf53b847",
                "sha256": "d7adc04df18d6cf744fa990a162f45b9b885d626827ac480f6eccac95e30f92f"
            },
            "downloads": -1,
            "filename": "pdf_page_annotator-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1ccf0b98f2f99dad7109c42adf53b847",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 17218,
            "upload_time": "2024-03-23T05:23:55",
            "upload_time_iso_8601": "2024-03-23T05:23:55.833851Z",
            "url": "https://files.pythonhosted.org/packages/bd/c3/d808c5f668b39a19a1ccd2a300457ed44164c1262404cb693013e64afd60/pdf_page_annotator-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-23 05:23:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ChinmayShrivastava",
    "github_project": "pdf-page-annotator",
    "github_not_found": true,
    "lcname": "pdf-page-annotator"
}
        
Elapsed time: 0.77892s