# iPyPDF
A Jupyter-based tool to help parse out structured text from a PDF document and explore the contents.
## Installation
### Windows Installer
https://drive.google.com/drive/folders/1wmQisECMor04dgv9ZXFc07zq6zcHuija?usp=sharing
This will make a start-menu shortcut called "iPyPDF" which will open up the notebook for parsing documents.
### From Source
1. Clone this repo
2. Install Anaconda or Miniconda if you do not already have it
3. Install mamba `conda install mamba`
* Solving the environment is impossibly slow without mamba
4. Create the environment and install ipypdf from source
```bash
mamba env create -f environment.yml -p env/ipypdf
conda activate env/ipypdf
pip install -e .
```
> Note: you can replace "mamba" with "conda" if you don't have mamba installed. It will just take longer to solve the environment.
### From pip
1. Create a conda environment with `Tesseract` and `Jupyterlab`
```bash
conda create -n ipypdf jupyterlab tesseract -c conda-forge`
conda activate ipypdf
pip install ipypdf
```
2. Get a spacy model (the previous method accomplishes this automatically in the `environment.yml` file)
1. `python -m spacy download en_core_web_sm`
2. Or `conda install spacy-model-en_core_web_sm -c conda-forge`
## Usage
ipypdf is built for jupyter lab but should also work in jupyter notebooks.
1. Launch jupyter lab with `jupyter lab`
```python
from ipypdf import App
app = App("path/to/your/pdfs", bulk_render=False)
app
```
see `notebooks` for additional info
### Development
see `DEVELOPMENT.md`
### Common Issues
* AutoTools widget keeps saying layoutparser is not installed
* This is usually a problem with pywin32.
* Try `conda install pywin32`
* Also make sure that numpy is <1.19.3
## Features
Within the GUI are 3 panels, Table of Contents, PDF viewer, and Tools.
In this section we are going over all of the various options available in the tools
panel.
### Auto-Tools
This tab contains tools which will iterate through each page of the pdf.
* `Text Only`: Runs each page through [Tesseract](https://github.com/tesseract-ocr/tesseract) to obtain plain text.
* `Parse Layout`: Uses [layoutparser](https://github.com/Layout-Parser/layout-parser)
to label portions of the document as either (title, text, image, or table). The sections
are then assembled together using a few simple rules in order to appoximate a shallow content hierarchy.
* Title and Text blocks are cropped out and sent through Tesseract to obtain the text.
* Tables are processed using a rule-based table parsing scheme described [here](https://github.com/JoelStansbury/PubTabNet/blob/main/README.pdf).
* Image blocks have no additional processing.

> Notice that section 3 is missing. The process is not perfect. In this case, a section title was mislabled by layoutparser as standard text. Mistakes like this are fairly common. To correct them, you can edit the table of contents using the arrow keys (the cursor must be hovering over the table of contents).
### Table Parsing

### Cytoscape
`Folders`, `PDF Documents`, and `Sections` have a tab labeled `Cytoscape`. This runs a tfidf similarity calculation over all nodes beneath the selected item. I.e. if you select the root node, then all defined nodes will be included in the calculation. However, only those with a link to another node will be drawn (this is for speed, may change this in the future).
The color of each node denotes the pdf document it originated from.

Selecting a node in the graph will highlight the node in the `DocTree`. Clicking the node in the `DocTree` will render the first page of the node.

### Spacy
Extracts named entities from the selected branch of the document tree. I.e.,
the raw text is compiled from a depth first search on whichever node is selected
in the table of contents. Then, `spacy.nlp(text).ents` returns the named entities
found within the section.

### Digitizing Utilities
> I recommend turning off `Show Boxes` as this changes pages every time you add a node (working on a better solution)
Each node has a specific set of tools available to use. Here are the tools provided when a `Section` node is selected.
Starting from the left:
* `Add Section Node` adds a sub-node of type `Section` and selects it
* `Add Text Node` adds a sub-node of type `Text` and selects it
* `Add Image Node` ...
* `Delete Node` Delete the selected node and all of its children

### Content Selector
Content is extracted from the rendered image. Text is extracted using Optical Character Recognition (OCR). Images don't do any image analysis, they just denote coordinates and page number so that they can be retreived later if need be.
When a `Section` node is selected, the selection tool will attempt to parse text from the portion of the page selected by the user. This text will __overwrite__ the label assigned to the node.
When a `Text` node is selected, the selection tool will attempt to parse text from the selected area and __append__ it to the node's content. This is because text blocks are not always perfectly rectangular, and often span multiple pages.
When an `Image` node is selected, the coordinates of the box are appended to the node's content.
### Save Button
This will generate `json` files for each document. When the tool is initialized, these are used to reconstruct the table of contents. You can also use the json file directly.
Raw data
{
"_id": null,
"home_page": "https://github.com/JoelStansbury/ipypdf",
"name": "ipypdf",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "ipypdf",
"author": "Joel Stansbury",
"author_email": "stansbury.joel@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/ea/a0/22ff59ebf5de7f69dbe9054af5b1362f6a2211b32fa2b38548223afc1a41/ipypdf-0.1.7.tar.gz",
"platform": null,
"description": "# iPyPDF\r\nA Jupyter-based tool to help parse out structured text from a PDF document and explore the contents.\r\n\r\n\r\n## Installation\r\n### Windows Installer\r\nhttps://drive.google.com/drive/folders/1wmQisECMor04dgv9ZXFc07zq6zcHuija?usp=sharing\r\n\r\nThis will make a start-menu shortcut called \"iPyPDF\" which will open up the notebook for parsing documents.\r\n\r\n### From Source\r\n1. Clone this repo\r\n2. Install Anaconda or Miniconda if you do not already have it\r\n3. Install mamba `conda install mamba`\r\n * Solving the environment is impossibly slow without mamba\r\n4. Create the environment and install ipypdf from source\r\n```bash\r\nmamba env create -f environment.yml -p env/ipypdf\r\nconda activate env/ipypdf\r\npip install -e .\r\n```\r\n> Note: you can replace \"mamba\" with \"conda\" if you don't have mamba installed. It will just take longer to solve the environment.\r\n\r\n\r\n### From pip\r\n1. Create a conda environment with `Tesseract` and `Jupyterlab`\r\n```bash\r\nconda create -n ipypdf jupyterlab tesseract -c conda-forge`\r\nconda activate ipypdf\r\npip install ipypdf\r\n```\r\n2. Get a spacy model (the previous method accomplishes this automatically in the `environment.yml` file)\r\n 1. `python -m spacy download en_core_web_sm`\r\n 2. Or `conda install spacy-model-en_core_web_sm -c conda-forge`\r\n\r\n\r\n## Usage\r\nipypdf is built for jupyter lab but should also work in jupyter notebooks.\r\n\r\n1. Launch jupyter lab with `jupyter lab`\r\n```python\r\nfrom ipypdf import App\r\napp = App(\"path/to/your/pdfs\", bulk_render=False)\r\napp\r\n```\r\n\r\nsee `notebooks` for additional info\r\n\r\n### Development\r\nsee `DEVELOPMENT.md`\r\n\r\n### Common Issues\r\n* AutoTools widget keeps saying layoutparser is not installed\r\n * This is usually a problem with pywin32.\r\n * Try `conda install pywin32`\r\n * Also make sure that numpy is <1.19.3\r\n\r\n\r\n## Features\r\nWithin the GUI are 3 panels, Table of Contents, PDF viewer, and Tools.\r\nIn this section we are going over all of the various options available in the tools\r\npanel.\r\n\r\n### Auto-Tools\r\nThis tab contains tools which will iterate through each page of the pdf.\r\n* `Text Only`: Runs each page through [Tesseract](https://github.com/tesseract-ocr/tesseract) to obtain plain text.\r\n* `Parse Layout`: Uses [layoutparser](https://github.com/Layout-Parser/layout-parser)\r\nto label portions of the document as either (title, text, image, or table). The sections\r\nare then assembled together using a few simple rules in order to appoximate a shallow content hierarchy.\r\n * Title and Text blocks are cropped out and sent through Tesseract to obtain the text.\r\n * Tables are processed using a rule-based table parsing scheme described [here](https://github.com/JoelStansbury/PubTabNet/blob/main/README.pdf). \r\n * Image blocks have no additional processing.\r\n\r\n\r\n\r\n> Notice that section 3 is missing. The process is not perfect. In this case, a section title was mislabled by layoutparser as standard text. Mistakes like this are fairly common. To correct them, you can edit the table of contents using the arrow keys (the cursor must be hovering over the table of contents).\r\n\r\n### Table Parsing\r\n\r\n\r\n\r\n### Cytoscape\r\n`Folders`, `PDF Documents`, and `Sections` have a tab labeled `Cytoscape`. This runs a tfidf similarity calculation over all nodes beneath the selected item. I.e. if you select the root node, then all defined nodes will be included in the calculation. However, only those with a link to another node will be drawn (this is for speed, may change this in the future).\r\n\r\nThe color of each node denotes the pdf document it originated from.\r\n\r\n\r\n\r\nSelecting a node in the graph will highlight the node in the `DocTree`. Clicking the node in the `DocTree` will render the first page of the node.\r\n\r\n\r\n\r\n### Spacy\r\nExtracts named entities from the selected branch of the document tree. I.e.,\r\nthe raw text is compiled from a depth first search on whichever node is selected\r\nin the table of contents. Then, `spacy.nlp(text).ents` returns the named entities\r\nfound within the section.\r\n\r\n\r\n\r\n### Digitizing Utilities\r\n> I recommend turning off `Show Boxes` as this changes pages every time you add a node (working on a better solution)\r\n\r\nEach node has a specific set of tools available to use. Here are the tools provided when a `Section` node is selected.\r\nStarting from the left:\r\n * `Add Section Node` adds a sub-node of type `Section` and selects it\r\n * `Add Text Node` adds a sub-node of type `Text` and selects it\r\n * `Add Image Node` ...\r\n * `Delete Node` Delete the selected node and all of its children\r\n\r\n\r\n\r\n### Content Selector\r\nContent is extracted from the rendered image. Text is extracted using Optical Character Recognition (OCR). Images don't do any image analysis, they just denote coordinates and page number so that they can be retreived later if need be.\r\n\r\nWhen a `Section` node is selected, the selection tool will attempt to parse text from the portion of the page selected by the user. This text will __overwrite__ the label assigned to the node.\r\n\r\nWhen a `Text` node is selected, the selection tool will attempt to parse text from the selected area and __append__ it to the node's content. This is because text blocks are not always perfectly rectangular, and often span multiple pages.\r\n\r\nWhen an `Image` node is selected, the coordinates of the box are appended to the node's content.\r\n\r\n### Save Button\r\nThis will generate `json` files for each document. When the tool is initialized, these are used to reconstruct the table of contents. You can also use the json file directly.\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Jupyter widget for applying nlp to pdf documents",
"version": "0.1.7",
"split_keywords": [
"ipypdf"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "d2dbd50f72210e2a59875d2d1c8f8566",
"sha256": "bdfe23f1d14c5de32b18a4f0b7d17c50f3d81b9c47cf7d3fb3b39e778ce9d1ca"
},
"downloads": -1,
"filename": "ipypdf-0.1.7-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "d2dbd50f72210e2a59875d2d1c8f8566",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 40944,
"upload_time": "2022-12-17T20:52:58",
"upload_time_iso_8601": "2022-12-17T20:52:58.787001Z",
"url": "https://files.pythonhosted.org/packages/fb/3a/4bd609facab4764e7f229da82643342bf8645277a57724a2d92ed7c0d498/ipypdf-0.1.7-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "3f6d74bea2052ed2a84eadf511b2842b",
"sha256": "7fe290b2dd504ed4cdc1c3a6ecfa304936f5e0f58e304867dfa640be09c4688f"
},
"downloads": -1,
"filename": "ipypdf-0.1.7.tar.gz",
"has_sig": false,
"md5_digest": "3f6d74bea2052ed2a84eadf511b2842b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 37964,
"upload_time": "2022-12-17T20:53:00",
"upload_time_iso_8601": "2022-12-17T20:53:00.671398Z",
"url": "https://files.pythonhosted.org/packages/ea/a0/22ff59ebf5de7f69dbe9054af5b1362f6a2211b32fa2b38548223afc1a41/ipypdf-0.1.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-12-17 20:53:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "JoelStansbury",
"github_project": "ipypdf",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "ipypdf"
}