Name | pdftext JSON |
Version |
0.3.5
JSON |
| download |
home_page | https://github.com/VikParuchuri/pdftext |
Summary | Extract structured text from pdfs quickly |
upload_time | 2024-05-02 23:57:40 |
maintainer | None |
docs_url | None |
author | Vik Paruchuri |
requires_python | !=2.7.*,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,!=3.7.*,!=3.8.*,>=3.9 |
license | Apache-2.0 |
keywords |
pdf
text
extraction
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# PDFText
Text extraction like [PyMuPDF](https://github.com/pymupdf/PyMuPDF), but without the AGPL license. PDFText extracts plain text or structured blocks and lines. It's built on [pypdfium2](https://github.com/pypdfium2-team/pypdfium2), so it's [fast, accurate](#benchmarks), and Apache licensed.
## Community
[Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.
# Installation
You'll need python 3.9+ first. Then run `pip install pdftext`.
# Usage
- Inspect the settings in `pdftext/settings.py`. You can override any settings with environment variables.
## Plain text
This command will write out a text file with the extracted plain text.
```shell
pdftext PDF_PATH --out_path output.txt
```
- `PDF_PATH` must be a single pdf file.
- `--out_path` path to the output txt file. If not specified, will write to stdout.
- `--sort` will attempt to sort in reading order if specified.
- `--keep_hyphens` will keep hyphens in the output (they will be stripped and words joined otherwise)
- `--pages` will specify pages (comma separated) to extract
## JSON
This command outputs structured blocks and lines with font and other information.
```shell
pdftext PDF_PATH --out_path output.txt --json
```
- `PDF_PATH` must be a single pdf file.
- `--out_path` path to the output txt file. If not specified, will write to stdout.
- `--json` specifies json output
- `--sort` will attempt to sort in reading order if specified.
- `--pages` will specify pages (comma separated) to extract
- `--keep_chars` will keep individual characters in the json output
The output will be a json list, with each item in the list corresponding to a single page in the input pdf (in order). Each page will include the following keys:
- `bbox` - the page bbox, in `[x1, y1, x2, y2]` format
- `rotation` - how much the page is rotated, in degrees (`0`, `90`, `180`, or `270`)
- `page` - the index of the page
- `blocks` - the blocks that make up the text in the pdf. Approximately equal to a paragraph.
- `bbox` - the block bbox, in `[x1, y1, x2, y2]` format
- `lines` - the lines inside the block
- `bbox` - the line bbox, in `[x1, y1, x2, y2]` format
- `spans` - the individual text spans in the line (text spans have the same font/weight/etc)
- `text` - the text in the span, encoded in utf-8
- `rotation` - how much the span is rotated, in degrees
- `bbox` - the span bbox, in `[x1, y1, x2, y2]` format
- `char_start_idx` - the start index of the first span character in the pdf
- `char_end_idx` - the end index of the last span character in the pdf
- `font` this is font info straight from the pdf, see [this pdfium code](https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_text.h)
- `size` - the size of the font used for the text
- `weight` - font weight
- `name` - font name, may be None
- `flags` - font flags, in the format of the `PDF spec 1.7 Section 5.7.1 Font Descriptor Flags`
If the pdf is rotated, the bboxes will be relative to the rotated page (they're rotated after being extracted).
# Programmatic usage
Extract plain text:
```python
import pypdfium2 as pdfium
from pdftext.extraction import plain_text_output
pdf = pdfium.PdfDocument(PDF_PATH)
text = plain_text_output(pdf, sort=False, hyphens=False, page_range=[1,2,3]) # Optional arguments explained above
```
Extract structured blocks and lines:
```python
import pypdfium2 as pdfium
from pdftext.extraction import dictionary_output
pdf = pdfium.PdfDocument(PDF_PATH)
text = dictionary_output(pdf, sort=False, page_range=[1,2,3], keep_chars=False) # Optional arguments explained above
```
If you want more customization, check out the `pdftext.extraction._get_pages` function for a starting point to dig deeper. pdftext is a pretty thin wrapper around [pypdfium2](https://pypdfium2.readthedocs.io/en/stable/), so you might want to look at the documentation for that as well.
# Benchmarks
I benchmarked extraction speed and accuracy of [pymupdf](https://pymupdf.readthedocs.io/en/latest/), [pdfplumber](https://github.com/jsvine/pdfplumber), and pdftext. I chose pymupdf because it extracts blocks and lines. Pdfplumber extracts words and bboxes. I did not benchmark pypdf, even though it is a great library, because it doesn't provide individual character/line/block and bbox information.
Here are the scores, run on an M1 Macbook, without multiprocessing:
| Library | Time (s per page) | Alignment Score (% accuracy vs pymupdf) |
|------------|-------------------|-----------------------------------------|
| pymupdf | 0.32 | -- |
| pdftext | 1.4 | 97.76 |
| pdfplumber | 3.0 | 90.3 |
pdftext is approximately 2x slower than using pypdfium2 alone (if you were to extract all the same character information).
There are additional benchmarks for pypdfium2 and other tools [here](https://github.com/py-pdf/benchmarks).
## Methodology
I used a benchmark set of 200 pdfs extracted from [common crawl](https://huggingface.co/datasets/pixparse/pdfa-eng-wds), then processed by a team at HuggingFace.
For each library, I used a detailed extraction method, to pull out font information, as well as just the words. This ensured we were comparing similar performance numbers. I formatted the text similarly when extracting - newlines after lines, and double newlines after blocks. For pdfplumber, I could only do the newlines after lines, since it doesn't recognize blocks.
For the alignment score, I extracted the text, then used the rapidfuzz library to find the alignment percentage. I used the text extracted by pymupdf as the pseudo-ground truth.
## Running benchmarks
You can run the benchmarks yourself. To do so, you have to first install pdftext manually. The install assumes you have poetry and Python 3.9+ installed.
```shell
git clone https://github.com/VikParuchuri/pdftext.git
cd pdftext
poetry install
python benchmark.py # Will download the benchmark pdfs automatically
```
The benchmark script has a few options:
- `--max` this controls the maximum number of pdfs to benchmark
- `--result_path` a folder to save the results. A file called `results.json` will be created in the folder.
- `--pdftext_only` skip running pdfplumber, which can be slow.
# How it works
PDFText is a very light wrapper around pypdfium2. It first uses pypdfium2 to extract characters in order, along with font and other information. Then it uses a simple decision tree algorithm to group characters into lines and blocks. It does some simple postprocessing to clean up the text.
# Credits
This is built on some amazing open source work, including:
- [pypdfium2](https://github.com/pypdfium2-team/pypdfium2)
- [scikit-learn](https://scikit-learn.org/stable/index.html)
- [pypdf](https://github.com/py-pdf/benchmarks) for very thorough and fair benchmarks
Thank you to the [pymupdf](https://github.com/pymupdf/PyMuPDF) devs for creating such a great library - I just wish it had a simpler license!
Raw data
{
"_id": null,
"home_page": "https://github.com/VikParuchuri/pdftext",
"name": "pdftext",
"maintainer": null,
"docs_url": null,
"requires_python": "!=2.7.*,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,!=3.7.*,!=3.8.*,>=3.9",
"maintainer_email": null,
"keywords": "pdf, text, extraction",
"author": "Vik Paruchuri",
"author_email": "vik.paruchuri@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/3a/42/13465b453d2572749496f3f0378c05dc1131e41c664ce72b236dda7c4686/pdftext-0.3.5.tar.gz",
"platform": null,
"description": "# PDFText\n\nText extraction like [PyMuPDF](https://github.com/pymupdf/PyMuPDF), but without the AGPL license. PDFText extracts plain text or structured blocks and lines. It's built on [pypdfium2](https://github.com/pypdfium2-team/pypdfium2), so it's [fast, accurate](#benchmarks), and Apache licensed.\n\n## Community\n\n[Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.\n\n# Installation\n\nYou'll need python 3.9+ first. Then run `pip install pdftext`.\n\n# Usage\n\n- Inspect the settings in `pdftext/settings.py`. You can override any settings with environment variables.\n\n## Plain text\n\nThis command will write out a text file with the extracted plain text.\n\n```shell\npdftext PDF_PATH --out_path output.txt\n```\n\n- `PDF_PATH` must be a single pdf file.\n- `--out_path` path to the output txt file. If not specified, will write to stdout.\n- `--sort` will attempt to sort in reading order if specified.\n- `--keep_hyphens` will keep hyphens in the output (they will be stripped and words joined otherwise)\n- `--pages` will specify pages (comma separated) to extract\n\n## JSON\n\nThis command outputs structured blocks and lines with font and other information.\n\n```shell\npdftext PDF_PATH --out_path output.txt --json\n```\n\n- `PDF_PATH` must be a single pdf file.\n- `--out_path` path to the output txt file. If not specified, will write to stdout.\n- `--json` specifies json output\n- `--sort` will attempt to sort in reading order if specified.\n- `--pages` will specify pages (comma separated) to extract\n- `--keep_chars` will keep individual characters in the json output\n\nThe output will be a json list, with each item in the list corresponding to a single page in the input pdf (in order). Each page will include the following keys:\n\n- `bbox` - the page bbox, in `[x1, y1, x2, y2]` format\n- `rotation` - how much the page is rotated, in degrees (`0`, `90`, `180`, or `270`)\n- `page` - the index of the page\n- `blocks` - the blocks that make up the text in the pdf. Approximately equal to a paragraph.\n - `bbox` - the block bbox, in `[x1, y1, x2, y2]` format\n - `lines` - the lines inside the block\n - `bbox` - the line bbox, in `[x1, y1, x2, y2]` format\n - `spans` - the individual text spans in the line (text spans have the same font/weight/etc)\n - `text` - the text in the span, encoded in utf-8\n - `rotation` - how much the span is rotated, in degrees\n - `bbox` - the span bbox, in `[x1, y1, x2, y2]` format\n - `char_start_idx` - the start index of the first span character in the pdf\n - `char_end_idx` - the end index of the last span character in the pdf\n - `font` this is font info straight from the pdf, see [this pdfium code](https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_text.h)\n - `size` - the size of the font used for the text\n - `weight` - font weight\n - `name` - font name, may be None\n - `flags` - font flags, in the format of the `PDF spec 1.7 Section 5.7.1 Font Descriptor Flags`\n\nIf the pdf is rotated, the bboxes will be relative to the rotated page (they're rotated after being extracted).\n\n# Programmatic usage\n\nExtract plain text:\n\n```python\nimport pypdfium2 as pdfium\nfrom pdftext.extraction import plain_text_output\n\npdf = pdfium.PdfDocument(PDF_PATH)\ntext = plain_text_output(pdf, sort=False, hyphens=False, page_range=[1,2,3]) # Optional arguments explained above\n```\n\nExtract structured blocks and lines:\n\n```python\nimport pypdfium2 as pdfium\nfrom pdftext.extraction import dictionary_output\n\npdf = pdfium.PdfDocument(PDF_PATH)\ntext = dictionary_output(pdf, sort=False, page_range=[1,2,3], keep_chars=False) # Optional arguments explained above\n```\n\nIf you want more customization, check out the `pdftext.extraction._get_pages` function for a starting point to dig deeper. pdftext is a pretty thin wrapper around [pypdfium2](https://pypdfium2.readthedocs.io/en/stable/), so you might want to look at the documentation for that as well.\n\n# Benchmarks\n\nI benchmarked extraction speed and accuracy of [pymupdf](https://pymupdf.readthedocs.io/en/latest/), [pdfplumber](https://github.com/jsvine/pdfplumber), and pdftext. I chose pymupdf because it extracts blocks and lines. Pdfplumber extracts words and bboxes. I did not benchmark pypdf, even though it is a great library, because it doesn't provide individual character/line/block and bbox information.\n\nHere are the scores, run on an M1 Macbook, without multiprocessing:\n\n| Library | Time (s per page) | Alignment Score (% accuracy vs pymupdf) |\n|------------|-------------------|-----------------------------------------|\n| pymupdf | 0.32 | -- |\n| pdftext | 1.4 | 97.76 |\n| pdfplumber | 3.0 | 90.3 |\n\npdftext is approximately 2x slower than using pypdfium2 alone (if you were to extract all the same character information).\n\nThere are additional benchmarks for pypdfium2 and other tools [here](https://github.com/py-pdf/benchmarks).\n\n## Methodology\n\nI used a benchmark set of 200 pdfs extracted from [common crawl](https://huggingface.co/datasets/pixparse/pdfa-eng-wds), then processed by a team at HuggingFace.\n\nFor each library, I used a detailed extraction method, to pull out font information, as well as just the words. This ensured we were comparing similar performance numbers. I formatted the text similarly when extracting - newlines after lines, and double newlines after blocks. For pdfplumber, I could only do the newlines after lines, since it doesn't recognize blocks.\n\nFor the alignment score, I extracted the text, then used the rapidfuzz library to find the alignment percentage. I used the text extracted by pymupdf as the pseudo-ground truth.\n\n## Running benchmarks\n\nYou can run the benchmarks yourself. To do so, you have to first install pdftext manually. The install assumes you have poetry and Python 3.9+ installed.\n\n```shell\ngit clone https://github.com/VikParuchuri/pdftext.git\ncd pdftext\npoetry install\npython benchmark.py # Will download the benchmark pdfs automatically\n```\n\nThe benchmark script has a few options:\n\n- `--max` this controls the maximum number of pdfs to benchmark\n- `--result_path` a folder to save the results. A file called `results.json` will be created in the folder.\n- `--pdftext_only` skip running pdfplumber, which can be slow.\n\n# How it works\n\nPDFText is a very light wrapper around pypdfium2. It first uses pypdfium2 to extract characters in order, along with font and other information. Then it uses a simple decision tree algorithm to group characters into lines and blocks. It does some simple postprocessing to clean up the text.\n\n# Credits\n\nThis is built on some amazing open source work, including:\n\n- [pypdfium2](https://github.com/pypdfium2-team/pypdfium2)\n- [scikit-learn](https://scikit-learn.org/stable/index.html)\n- [pypdf](https://github.com/py-pdf/benchmarks) for very thorough and fair benchmarks\n\nThank you to the [pymupdf](https://github.com/pymupdf/PyMuPDF) devs for creating such a great library - I just wish it had a simpler license!",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Extract structured text from pdfs quickly",
"version": "0.3.5",
"project_urls": {
"Homepage": "https://github.com/VikParuchuri/pdftext",
"Repository": "https://github.com/VikParuchuri/pdftext"
},
"split_keywords": [
"pdf",
" text",
" extraction"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "be42ccc593691318d6c5c64b783a4782867bfc0f3f5728b22273aceacc1f7a23",
"md5": "4f450621be8d69849b7dc3dfad058d5c",
"sha256": "2a1649b1f2b8ea563fd4f2a3a7227afb0693622b5e3820bca390817d92f228c7"
},
"downloads": -1,
"filename": "pdftext-0.3.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4f450621be8d69849b7dc3dfad058d5c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "!=2.7.*,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,!=3.7.*,!=3.8.*,>=3.9",
"size": 24788,
"upload_time": "2024-05-02T23:57:38",
"upload_time_iso_8601": "2024-05-02T23:57:38.834811Z",
"url": "https://files.pythonhosted.org/packages/be/42/ccc593691318d6c5c64b783a4782867bfc0f3f5728b22273aceacc1f7a23/pdftext-0.3.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3a4213465b453d2572749496f3f0378c05dc1131e41c664ce72b236dda7c4686",
"md5": "c3c7a0afdc8c92c5e70f4f7c816d53da",
"sha256": "bd2c4c918889894488b18fa6395eff77138dcb8762fc3c44f08a402597618d41"
},
"downloads": -1,
"filename": "pdftext-0.3.5.tar.gz",
"has_sig": false,
"md5_digest": "c3c7a0afdc8c92c5e70f4f7c816d53da",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "!=2.7.*,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,!=3.7.*,!=3.8.*,>=3.9",
"size": 25756,
"upload_time": "2024-05-02T23:57:40",
"upload_time_iso_8601": "2024-05-02T23:57:40.486741Z",
"url": "https://files.pythonhosted.org/packages/3a/42/13465b453d2572749496f3f0378c05dc1131e41c664ce72b236dda7c4686/pdftext-0.3.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-02 23:57:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "VikParuchuri",
"github_project": "pdftext",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pdftext"
}