spacypdfreader


Namespacypdfreader JSON
Version 0.3.1 PyPI version JSON
download
home_pagehttps://github.com/SamEdwardes/spaCyPDFreader
SummaryA PDF to text extraction pipeline component for spaCy.
upload_time2023-10-17 16:17:14
maintainer
docs_urlNone
authorSamEdwardes
requires_python>=3.8,<3.12
licenseMIT
keywords python spacy nlp pdf pdfs
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # spacypdfreader

Easy PDF to text to *spaCy* text extraction in Python.

<p>
    <!-- PyPI version -->
    <a href="https://pypi.org/project/spacypdfreader" target="_blank">
        <img src="https://img.shields.io/pypi/v/spacypdfreader?color=%2334D058&label=pypi%20package" alt="Package version">
    </a>
    <!-- PyPi Downloads -->
    <img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dm/spacypdfreader?label=PyPi%20downloads">
    <!-- Pytest -->
    <a href="https://github.com/SamEdwardes/spacypdfreader/actions/workflows/pytest.yml" target="_blank">
        <img src="https://github.com/SamEdwardes/spacypdfreader/actions/workflows/pytest.yml/badge.svg" alt="pytest">
    </a>
</p>

<hr></hr>

**Documentation:** [https://samedwardes.github.io/spacypdfreader/](https://samedwardes.github.io/spacypdfreader/)

**Source code:** [https://github.com/SamEdwardes/spacypdfreader](https://github.com/SamEdwardes/spacypdfreader)

**PyPi:** [https://pypi.org/project/spacypdfreader/](https://pypi.org/project/spacypdfreader/)

**spaCy universe:** [https://spacy.io/universe/project/spacypdfreader](https://spacy.io/universe/project/spacypdfreader)

<hr></hr>

*spacypdfreader* is a python library for extracting text from PDF documents into *spaCy* `Doc` objects. When you use *spacypdfreader* the token and doc objects from spacy are annotated with additional information about the pdf.

The key features are:

- **PDF to spaCy Doc object:** Convert a PDF document directly into a *spaCy* `Doc` object.
- **Custom spaCy attributes and methods:**
    - `token._.page_number`
    - `doc._.page_range`
    - `doc._.first_page`
    - `doc._.last_page`
    - `doc._.pdf_file_name`
    - `doc._.page(int)`
- **Multiple parsers:** Select between multiple built in PDF to text parsers or bring your own PDF to text parser.

## Installation

Install *spacypdfreader* using pip:

```bash
pip install spacypdfreader
```

To install with the required pytesseract dependencies:

```bash
pip install 'spacypdfreader[pytesseract]'
```

## Usage

```python
import spacy

from spacypdfreader import pdf_reader

nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)

# Get the page number of any token.
print(doc[0]._.page_number)  # 1
print(doc[-1]._.page_number)  # 4

# Get page meta data about the PDF document.
print(doc._.pdf_file_name)  # "tests/data/test_pdf_01.pdf"
print(doc._.page_range)  # (1, 4)
print(doc._.first_page)  # 1
print(doc._.last_page)  # 4

# Get all of the text from a specific PDF page.
print(doc._.page(4))  # "able to display the destination page (unless..."
```

## What is *spaCy*?

*spaCy* is a natural language processing (NLP) tool. It can be used to perform a variety of NLP tasks. For more information check out the excellent documentation at [https://spacy.io](https://spacy.io).

## Implementation Notes

spaCyPDFreader behaves a little bit different than your typical [spaCy custom component](https://spacy.io/usage/processing-pipelines#custom-components). Typically a spaCy component should receive and return a `spacy.tokens.Doc` object.

spaCyPDFreader breaks this convention because the text must first be extracted from the PDF. Instead `pdf_reader` takes a path to a PDF file and a `spacy.Language` object as parameters and returns a `spacy.tokens.Doc` object. This allows users an easy way to extract text from PDF files while still allowing them use and customize all of the features spacy has to offer by allowing you to pass in the `spacy.Language` object.

Example of a "traditional" spaCy pipeline component [negspaCy](https://spacy.io/universe/project/negspacy):

```python
import spacy
from negspacy.negation import Negex

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("negex", config={"ent_types": ["PERSON", "ORG"]})
doc = nlp("She does not like Steve Jobs but likes Apple products.")
```

Example of `spaCyPDFreader` usage:

```python
import spacy

from spacypdfreader import pdf_reader

nlp = spacy.load("en_core_web_sm")

doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
```

Note that the `nlp.add_pipe` is not used by spaCyPDFreader.
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/SamEdwardes/spaCyPDFreader",
    "name": "spacypdfreader",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<3.12",
    "maintainer_email": "",
    "keywords": "python,spacy,nlp,pdf,pdfs",
    "author": "SamEdwardes",
    "author_email": "edwardes.s@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/5d/06/348fe206bc13980b9b32726f98e9710a1b179eda367d000a29727f07d998/spacypdfreader-0.3.1.tar.gz",
    "platform": null,
    "description": "# spacypdfreader\n\nEasy PDF to text to *spaCy* text extraction in Python.\n\n<p>\n    <!-- PyPI version -->\n    <a href=\"https://pypi.org/project/spacypdfreader\" target=\"_blank\">\n        <img src=\"https://img.shields.io/pypi/v/spacypdfreader?color=%2334D058&label=pypi%20package\" alt=\"Package version\">\n    </a>\n    <!-- PyPi Downloads -->\n    <img alt=\"PyPI - Downloads\" src=\"https://img.shields.io/pypi/dm/spacypdfreader?label=PyPi%20downloads\">\n    <!-- Pytest -->\n    <a href=\"https://github.com/SamEdwardes/spacypdfreader/actions/workflows/pytest.yml\" target=\"_blank\">\n        <img src=\"https://github.com/SamEdwardes/spacypdfreader/actions/workflows/pytest.yml/badge.svg\" alt=\"pytest\">\n    </a>\n</p>\n\n<hr></hr>\n\n**Documentation:** [https://samedwardes.github.io/spacypdfreader/](https://samedwardes.github.io/spacypdfreader/)\n\n**Source code:** [https://github.com/SamEdwardes/spacypdfreader](https://github.com/SamEdwardes/spacypdfreader)\n\n**PyPi:** [https://pypi.org/project/spacypdfreader/](https://pypi.org/project/spacypdfreader/)\n\n**spaCy universe:** [https://spacy.io/universe/project/spacypdfreader](https://spacy.io/universe/project/spacypdfreader)\n\n<hr></hr>\n\n*spacypdfreader* is a python library for extracting text from PDF documents into *spaCy* `Doc` objects. When you use *spacypdfreader* the token and doc objects from spacy are annotated with additional information about the pdf.\n\nThe key features are:\n\n- **PDF to spaCy Doc object:** Convert a PDF document directly into a *spaCy* `Doc` object.\n- **Custom spaCy attributes and methods:**\n    - `token._.page_number`\n    - `doc._.page_range`\n    - `doc._.first_page`\n    - `doc._.last_page`\n    - `doc._.pdf_file_name`\n    - `doc._.page(int)`\n- **Multiple parsers:** Select between multiple built in PDF to text parsers or bring your own PDF to text parser.\n\n## Installation\n\nInstall *spacypdfreader* using pip:\n\n```bash\npip install spacypdfreader\n```\n\nTo install with the required pytesseract dependencies:\n\n```bash\npip install 'spacypdfreader[pytesseract]'\n```\n\n## Usage\n\n```python\nimport spacy\n\nfrom spacypdfreader import pdf_reader\n\nnlp = spacy.load(\"en_core_web_sm\")\ndoc = pdf_reader(\"tests/data/test_pdf_01.pdf\", nlp)\n\n# Get the page number of any token.\nprint(doc[0]._.page_number)  # 1\nprint(doc[-1]._.page_number)  # 4\n\n# Get page meta data about the PDF document.\nprint(doc._.pdf_file_name)  # \"tests/data/test_pdf_01.pdf\"\nprint(doc._.page_range)  # (1, 4)\nprint(doc._.first_page)  # 1\nprint(doc._.last_page)  # 4\n\n# Get all of the text from a specific PDF page.\nprint(doc._.page(4))  # \"able to display the destination page (unless...\"\n```\n\n## What is *spaCy*?\n\n*spaCy* is a natural language processing (NLP) tool. It can be used to perform a variety of NLP tasks. For more information check out the excellent documentation at [https://spacy.io](https://spacy.io).\n\n## Implementation Notes\n\nspaCyPDFreader behaves a little bit different than your typical [spaCy custom component](https://spacy.io/usage/processing-pipelines#custom-components). Typically a spaCy component should receive and return a `spacy.tokens.Doc` object.\n\nspaCyPDFreader breaks this convention because the text must first be extracted from the PDF. Instead `pdf_reader` takes a path to a PDF file and a `spacy.Language` object as parameters and returns a `spacy.tokens.Doc` object. This allows users an easy way to extract text from PDF files while still allowing them use and customize all of the features spacy has to offer by allowing you to pass in the `spacy.Language` object.\n\nExample of a \"traditional\" spaCy pipeline component [negspaCy](https://spacy.io/universe/project/negspacy):\n\n```python\nimport spacy\nfrom negspacy.negation import Negex\n\nnlp = spacy.load(\"en_core_web_sm\")\nnlp.add_pipe(\"negex\", config={\"ent_types\": [\"PERSON\", \"ORG\"]})\ndoc = nlp(\"She does not like Steve Jobs but likes Apple products.\")\n```\n\nExample of `spaCyPDFreader` usage:\n\n```python\nimport spacy\n\nfrom spacypdfreader import pdf_reader\n\nnlp = spacy.load(\"en_core_web_sm\")\n\ndoc = pdf_reader(\"tests/data/test_pdf_01.pdf\", nlp)\n```\n\nNote that the `nlp.add_pipe` is not used by spaCyPDFreader.",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A PDF to text extraction pipeline component for spaCy.",
    "version": "0.3.1",
    "project_urls": {
        "Homepage": "https://github.com/SamEdwardes/spaCyPDFreader",
        "Repository": "https://github.com/SamEdwardes/spaCyPDFreader"
    },
    "split_keywords": [
        "python",
        "spacy",
        "nlp",
        "pdf",
        "pdfs"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "26e6537329be49d31242702a3534b547a3b89783542823c2d87cb74148d9f2f6",
                "md5": "01d1c87a1ba864d04a4c8dd23e241eeb",
                "sha256": "b8a9db1fe440393928a3170dcf1892d9a348cf2ed8d838fbd2674985ddc04a3a"
            },
            "downloads": -1,
            "filename": "spacypdfreader-0.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "01d1c87a1ba864d04a4c8dd23e241eeb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<3.12",
            "size": 10020,
            "upload_time": "2023-10-17T16:17:12",
            "upload_time_iso_8601": "2023-10-17T16:17:12.119141Z",
            "url": "https://files.pythonhosted.org/packages/26/e6/537329be49d31242702a3534b547a3b89783542823c2d87cb74148d9f2f6/spacypdfreader-0.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5d06348fe206bc13980b9b32726f98e9710a1b179eda367d000a29727f07d998",
                "md5": "93830079d7a9cc43092c6453004ea33d",
                "sha256": "ed6505639cec391b4145895e915307b03d70e98b4e1d54eef4683e8a72d20ff5"
            },
            "downloads": -1,
            "filename": "spacypdfreader-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "93830079d7a9cc43092c6453004ea33d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<3.12",
            "size": 8696,
            "upload_time": "2023-10-17T16:17:14",
            "upload_time_iso_8601": "2023-10-17T16:17:14.504608Z",
            "url": "https://files.pythonhosted.org/packages/5d/06/348fe206bc13980b9b32726f98e9710a1b179eda367d000a29727f07d998/spacypdfreader-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-17 16:17:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "SamEdwardes",
    "github_project": "spaCyPDFreader",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "spacypdfreader"
}
        
Elapsed time: 0.12340s