# PDFMicroarray
## Overview
PDFMicroarray is a Python CLI tool designed to assist with literature research for scientific papers and books.
It extracts text from multiples sources within PDF documents, including:
- Plain text
- Text from images (through OCR)
- Text from embedded diagrams (through page rendering and OCR)
and stores the extracted text in a designated output directory.
The processed text can then analyzed using the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to detect the occurrences of specified words. A graphical representation of these occurrences is offered in a microarray format.
## Installation
[Tesseract](https://github.com/tesseract-ocr/tesseract) is required for this CLI tool. Please follow the [installation](https://tesseract-ocr.github.io/tessdoc/Installation.html) instructions for your platform.
```bash
pip install pipx
pipx install pdf-microarray
```
## Usage
```bash
mkdir processed
pdf-microarray process -i documents -o processed
pdf-microarray analyze -i processed -w words.txt -o data.csv
pdf-microarray plot -i data.csv -o plot.png
```
The words in `words.txt` should be separated by newlines. If multiple words are on the same line, only the occurrence of all words is taken into account.
## Example
![Example](example.png)
## Technical details
The library uses document segmentation and multithreading to speed up the extraction process, so that even large books in PDF form can be parsed within a few minutes.
The library utilized [PyMuPDF](https://pypi.org/project/PyMuPDF) for OCR, [pytesseract](https://pypi.org/project/pytesseract) for PDF page rendering and [thefuzz](https://pypi.org/project/thefuzz) to calculate the Levenshtein distance.
## Contributing
Contributions to PDFMicroarray are welcome! Please feel free to fork the repository, make changes, and submit pull requests. For major changes, please open an issue first to discuss what you would like to change.
## License
Distributed under the GNU General Public License v3.0 license. See LICENSE for more information.
Raw data
{
"_id": null,
"home_page": "https://github.com/bjorntropf/PDFMicroarray",
"name": "pdf-microarray",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": "pdf, microarray",
"author": "Bj\u00f6rn Tropf",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/f1/e5/c72c8641164d47f5e93b0b293e9151aaffcbf7ac23b56300bb870cc92768/pdf_microarray-1.0.1.tar.gz",
"platform": null,
"description": "# PDFMicroarray\n\n## Overview\n\nPDFMicroarray is a Python CLI tool designed to assist with literature research for scientific papers and books.\n\nIt extracts text from multiples sources within PDF documents, including:\n\n- Plain text\n- Text from images (through OCR)\n- Text from embedded diagrams (through page rendering and OCR)\n\nand stores the extracted text in a designated output directory.\n\nThe processed text can then analyzed using the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to detect the occurrences of specified words. A graphical representation of these occurrences is offered in a microarray format.\n\n## Installation\n\n[Tesseract](https://github.com/tesseract-ocr/tesseract) is required for this CLI tool. Please follow the [installation](https://tesseract-ocr.github.io/tessdoc/Installation.html) instructions for your platform.\n\n```bash\npip install pipx\npipx install pdf-microarray\n```\n\n## Usage\n\n```bash\nmkdir processed\npdf-microarray process -i documents -o processed\npdf-microarray analyze -i processed -w words.txt -o data.csv\npdf-microarray plot -i data.csv -o plot.png\n```\n\nThe words in `words.txt` should be separated by newlines. If multiple words are on the same line, only the occurrence of all words is taken into account.\n\n## Example\n\n![Example](example.png)\n\n## Technical details\n\nThe library uses document segmentation and multithreading to speed up the extraction process, so that even large books in PDF form can be parsed within a few minutes.\n\nThe library utilized [PyMuPDF](https://pypi.org/project/PyMuPDF) for OCR, [pytesseract](https://pypi.org/project/pytesseract) for PDF page rendering and [thefuzz](https://pypi.org/project/thefuzz) to calculate the Levenshtein distance.\n\n## Contributing\n\nContributions to PDFMicroarray are welcome! Please feel free to fork the repository, make changes, and submit pull requests. For major changes, please open an issue first to discuss what you would like to change.\n\n## License\n\nDistributed under the GNU General Public License v3.0 license. See LICENSE for more information.\n",
"bugtrack_url": null,
"license": "GPL-3.0-or-later",
"summary": "A Python CLI tool designed to assist with literature research by visualizing the occurrence of words in PDF documents with a microarray format.",
"version": "1.0.1",
"project_urls": {
"Homepage": "https://github.com/bjorntropf/PDFMicroarray",
"Repository": "https://github.com/bjorntropf/PDFMicroarray"
},
"split_keywords": [
"pdf",
" microarray"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9feba7b2834c8767aab521c1054c87fedfa27cd9926be779e9858a87572e50f7",
"md5": "c5013c59b440d513ee2466c79cd9c908",
"sha256": "b38ff7ada94a954342e637a0be02dfb41caaff982707e567988bba741ff13ddd"
},
"downloads": -1,
"filename": "pdf_microarray-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c5013c59b440d513ee2466c79cd9c908",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 35630,
"upload_time": "2024-05-12T09:30:23",
"upload_time_iso_8601": "2024-05-12T09:30:23.892161Z",
"url": "https://files.pythonhosted.org/packages/9f/eb/a7b2834c8767aab521c1054c87fedfa27cd9926be779e9858a87572e50f7/pdf_microarray-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f1e5c72c8641164d47f5e93b0b293e9151aaffcbf7ac23b56300bb870cc92768",
"md5": "a0a1ce47ba34a0cccc6c0fc46eb2f7ba",
"sha256": "31a03c94e5c0993d171a54c6fd232beca73d5eb776d86e79f4071e8e42425b30"
},
"downloads": -1,
"filename": "pdf_microarray-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "a0a1ce47ba34a0cccc6c0fc46eb2f7ba",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 21006,
"upload_time": "2024-05-12T09:30:24",
"upload_time_iso_8601": "2024-05-12T09:30:24.987506Z",
"url": "https://files.pythonhosted.org/packages/f1/e5/c72c8641164d47f5e93b0b293e9151aaffcbf7ac23b56300bb870cc92768/pdf_microarray-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-12 09:30:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "bjorntropf",
"github_project": "PDFMicroarray",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pdf-microarray"
}