# PDFText
Text extraction like [PyMuPDF](https://github.com/pymupdf/PyMuPDF), but without the AGPL license. PDFText extracts plain text or structured blocks and lines. It's built on [pypdfium2](https://github.com/pypdfium2-team/pypdfium2), so it's [fast, accurate](#benchmarks), and Apache licensed.
## Community
[Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.
# Installation
You'll need python 3.9+ first. Then run `pip install pdftext`.
# Usage
- Inspect the settings in `pdftext/settings.py`. You can override any settings with environment variables.
## Plain text
This command will write out a text file with the extracted plain text.
```shell
pdftext PDF_PATH --out_path output.txt
```
- `PDF_PATH` must be a single pdf file.
- `--out_path` path to the output txt file. If not specified, will write to stdout.
- `--sort` will attempt to sort in reading order if specified.
- `--keep_hyphens` will keep hyphens in the output (they will be stripped and words joined otherwise)
- `--pages` will specify pages (comma separated) to extract
- `--workers` specifies the number of parallel workers to use
- `--flatten_pdf` merges form fields into the PDF
## JSON
This command outputs structured blocks and lines with font and other information.
```shell
pdftext PDF_PATH --out_path output.txt --json
```
- `PDF_PATH` must be a single pdf file.
- `--out_path` path to the output txt file. If not specified, will write to stdout.
- `--json` specifies json output
- `--sort` will attempt to sort in reading order if specified.
- `--pages` will specify pages (comma separated) to extract
- `--keep_chars` will keep individual characters in the json output
- `--workers` specifies the number of parallel workers to use
- `--flatten_pdf` merges form fields into the PDF
The output will be a json list, with each item in the list corresponding to a single page in the input pdf (in order). Each page will include the following keys:
- `bbox` - the page bbox, in `[x1, y1, x2, y2]` format
- `rotation` - how much the page is rotated, in degrees (`0`, `90`, `180`, or `270`)
- `page` - the index of the page
- `blocks` - the blocks that make up the text in the pdf. Approximately equal to a paragraph.
- `bbox` - the block bbox, in `[x1, y1, x2, y2]` format
- `lines` - the lines inside the block
- `bbox` - the line bbox, in `[x1, y1, x2, y2]` format
- `spans` - the individual text spans in the line (text spans have the same font/weight/etc)
- `text` - the text in the span, encoded in utf-8
- `rotation` - how much the span is rotated, in degrees
- `bbox` - the span bbox, in `[x1, y1, x2, y2]` format
- `char_start_idx` - the start index of the first span character in the pdf
- `char_end_idx` - the end index of the last span character in the pdf
- `font` this is font info straight from the pdf, see [this pdfium code](https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_text.h)
- `size` - the size of the font used for the text
- `weight` - font weight
- `name` - font name, may be None
- `flags` - font flags, in the format of the `PDF spec 1.7 Section 5.7.1 Font Descriptor Flags`
If the pdf is rotated, the bboxes will be relative to the rotated page (they're rotated after being extracted).
# Programmatic usage
Extract plain text:
```python
from pdftext.extraction import plain_text_output
text = plain_text_output(PDF_PATH, sort=False, hyphens=False, page_range=[1,2,3]) # Optional arguments explained above
```
Extract structured blocks and lines:
```python
from pdftext.extraction import dictionary_output
text = dictionary_output(PDF_PATH, sort=False, page_range=[1,2,3], keep_chars=False) # Optional arguments explained above
```
If you want more customization, check out the `pdftext.extraction._get_pages` function for a starting point to dig deeper. pdftext is a pretty thin wrapper around [pypdfium2](https://pypdfium2.readthedocs.io/en/stable/), so you might want to look at the documentation for that as well.
# Benchmarks
I benchmarked extraction speed and accuracy of [pymupdf](https://pymupdf.readthedocs.io/en/latest/), [pdfplumber](https://github.com/jsvine/pdfplumber), and pdftext. I chose pymupdf because it extracts blocks and lines. Pdfplumber extracts words and bboxes. I did not benchmark pypdf, even though it is a great library, because it doesn't provide individual character/line/block and bbox information.
Here are the scores, run on an M1 Macbook, without multiprocessing:
| Library | Time (s per page) | Alignment Score (% accuracy vs pymupdf) |
|------------|-------------------|-----------------------------------------|
| pymupdf | 0.32 | -- |
| pdftext | 1.4 | 97.76 |
| pdfplumber | 3.0 | 90.3 |
pdftext is approximately 2x slower than using pypdfium2 alone (if you were to extract all the same character information).
There are additional benchmarks for pypdfium2 and other tools [here](https://github.com/py-pdf/benchmarks).
## Methodology
I used a benchmark set of 200 pdfs extracted from [common crawl](https://huggingface.co/datasets/pixparse/pdfa-eng-wds), then processed by a team at HuggingFace.
For each library, I used a detailed extraction method, to pull out font information, as well as just the words. This ensured we were comparing similar performance numbers. I formatted the text similarly when extracting - newlines after lines, and double newlines after blocks. For pdfplumber, I could only do the newlines after lines, since it doesn't recognize blocks.
For the alignment score, I extracted the text, then used the rapidfuzz library to find the alignment percentage. I used the text extracted by pymupdf as the pseudo-ground truth.
## Running benchmarks
You can run the benchmarks yourself. To do so, you have to first install pdftext manually. The install assumes you have poetry and Python 3.9+ installed.
```shell
git clone https://github.com/VikParuchuri/pdftext.git
cd pdftext
poetry install
python benchmark.py # Will download the benchmark pdfs automatically
```
The benchmark script has a few options:
- `--max` this controls the maximum number of pdfs to benchmark
- `--result_path` a folder to save the results. A file called `results.json` will be created in the folder.
- `--pdftext_only` skip running pdfplumber, which can be slow.
# How it works
PDFText is a very light wrapper around pypdfium2. It first uses pypdfium2 to extract characters in order, along with font and other information. Then it uses a simple decision tree algorithm to group characters into lines and blocks. It does some simple postprocessing to clean up the text.
# Credits
This is built on some amazing open source work, including:
- [pypdfium2](https://github.com/pypdfium2-team/pypdfium2)
- [scikit-learn](https://scikit-learn.org/stable/index.html)
- [pypdf](https://github.com/py-pdf/benchmarks) for very thorough and fair benchmarks
Thank you to the [pymupdf](https://github.com/pymupdf/PyMuPDF) devs for creating such a great library - I just wish it had a simpler license!
Raw data
{
"_id": null,
"home_page": "https://github.com/VikParuchuri/pdftext",
"name": "pdftext",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": "pdf, text, extraction",
"author": "Vik Paruchuri",
"author_email": "vik.paruchuri@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/7d/71/980b6890894cc956320e8e5378af71431f0dc1b13022e2ded19971567846/pdftext-0.4.0.tar.gz",
"platform": null,
"description": "# PDFText\n\nText extraction like [PyMuPDF](https://github.com/pymupdf/PyMuPDF), but without the AGPL license. PDFText extracts plain text or structured blocks and lines. It's built on [pypdfium2](https://github.com/pypdfium2-team/pypdfium2), so it's [fast, accurate](#benchmarks), and Apache licensed.\n\n## Community\n\n[Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.\n\n# Installation\n\nYou'll need python 3.9+ first. Then run `pip install pdftext`.\n\n# Usage\n\n- Inspect the settings in `pdftext/settings.py`. You can override any settings with environment variables.\n\n## Plain text\n\nThis command will write out a text file with the extracted plain text.\n\n```shell\npdftext PDF_PATH --out_path output.txt\n```\n\n- `PDF_PATH` must be a single pdf file.\n- `--out_path` path to the output txt file. If not specified, will write to stdout.\n- `--sort` will attempt to sort in reading order if specified.\n- `--keep_hyphens` will keep hyphens in the output (they will be stripped and words joined otherwise)\n- `--pages` will specify pages (comma separated) to extract\n- `--workers` specifies the number of parallel workers to use\n- `--flatten_pdf` merges form fields into the PDF\n\n## JSON\n\nThis command outputs structured blocks and lines with font and other information.\n\n```shell\npdftext PDF_PATH --out_path output.txt --json\n```\n\n- `PDF_PATH` must be a single pdf file.\n- `--out_path` path to the output txt file. If not specified, will write to stdout.\n- `--json` specifies json output\n- `--sort` will attempt to sort in reading order if specified.\n- `--pages` will specify pages (comma separated) to extract\n- `--keep_chars` will keep individual characters in the json output\n- `--workers` specifies the number of parallel workers to use\n- `--flatten_pdf` merges form fields into the PDF\n\nThe output will be a json list, with each item in the list corresponding to a single page in the input pdf (in order). Each page will include the following keys:\n\n- `bbox` - the page bbox, in `[x1, y1, x2, y2]` format\n- `rotation` - how much the page is rotated, in degrees (`0`, `90`, `180`, or `270`)\n- `page` - the index of the page\n- `blocks` - the blocks that make up the text in the pdf. Approximately equal to a paragraph.\n - `bbox` - the block bbox, in `[x1, y1, x2, y2]` format\n - `lines` - the lines inside the block\n - `bbox` - the line bbox, in `[x1, y1, x2, y2]` format\n - `spans` - the individual text spans in the line (text spans have the same font/weight/etc)\n - `text` - the text in the span, encoded in utf-8\n - `rotation` - how much the span is rotated, in degrees\n - `bbox` - the span bbox, in `[x1, y1, x2, y2]` format\n - `char_start_idx` - the start index of the first span character in the pdf\n - `char_end_idx` - the end index of the last span character in the pdf\n - `font` this is font info straight from the pdf, see [this pdfium code](https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_text.h)\n - `size` - the size of the font used for the text\n - `weight` - font weight\n - `name` - font name, may be None\n - `flags` - font flags, in the format of the `PDF spec 1.7 Section 5.7.1 Font Descriptor Flags`\n\nIf the pdf is rotated, the bboxes will be relative to the rotated page (they're rotated after being extracted).\n\n# Programmatic usage\n\nExtract plain text:\n\n```python\nfrom pdftext.extraction import plain_text_output\n\ntext = plain_text_output(PDF_PATH, sort=False, hyphens=False, page_range=[1,2,3]) # Optional arguments explained above\n```\n\nExtract structured blocks and lines:\n\n```python\nfrom pdftext.extraction import dictionary_output\n\ntext = dictionary_output(PDF_PATH, sort=False, page_range=[1,2,3], keep_chars=False) # Optional arguments explained above\n```\n\nIf you want more customization, check out the `pdftext.extraction._get_pages` function for a starting point to dig deeper. pdftext is a pretty thin wrapper around [pypdfium2](https://pypdfium2.readthedocs.io/en/stable/), so you might want to look at the documentation for that as well.\n\n# Benchmarks\n\nI benchmarked extraction speed and accuracy of [pymupdf](https://pymupdf.readthedocs.io/en/latest/), [pdfplumber](https://github.com/jsvine/pdfplumber), and pdftext. I chose pymupdf because it extracts blocks and lines. Pdfplumber extracts words and bboxes. I did not benchmark pypdf, even though it is a great library, because it doesn't provide individual character/line/block and bbox information.\n\nHere are the scores, run on an M1 Macbook, without multiprocessing:\n\n| Library | Time (s per page) | Alignment Score (% accuracy vs pymupdf) |\n|------------|-------------------|-----------------------------------------|\n| pymupdf | 0.32 | -- |\n| pdftext | 1.4 | 97.76 |\n| pdfplumber | 3.0 | 90.3 |\n\npdftext is approximately 2x slower than using pypdfium2 alone (if you were to extract all the same character information).\n\nThere are additional benchmarks for pypdfium2 and other tools [here](https://github.com/py-pdf/benchmarks).\n\n## Methodology\n\nI used a benchmark set of 200 pdfs extracted from [common crawl](https://huggingface.co/datasets/pixparse/pdfa-eng-wds), then processed by a team at HuggingFace.\n\nFor each library, I used a detailed extraction method, to pull out font information, as well as just the words. This ensured we were comparing similar performance numbers. I formatted the text similarly when extracting - newlines after lines, and double newlines after blocks. For pdfplumber, I could only do the newlines after lines, since it doesn't recognize blocks.\n\nFor the alignment score, I extracted the text, then used the rapidfuzz library to find the alignment percentage. I used the text extracted by pymupdf as the pseudo-ground truth.\n\n## Running benchmarks\n\nYou can run the benchmarks yourself. To do so, you have to first install pdftext manually. The install assumes you have poetry and Python 3.9+ installed.\n\n```shell\ngit clone https://github.com/VikParuchuri/pdftext.git\ncd pdftext\npoetry install\npython benchmark.py # Will download the benchmark pdfs automatically\n```\n\nThe benchmark script has a few options:\n\n- `--max` this controls the maximum number of pdfs to benchmark\n- `--result_path` a folder to save the results. A file called `results.json` will be created in the folder.\n- `--pdftext_only` skip running pdfplumber, which can be slow.\n\n# How it works\n\nPDFText is a very light wrapper around pypdfium2. It first uses pypdfium2 to extract characters in order, along with font and other information. Then it uses a simple decision tree algorithm to group characters into lines and blocks. It does some simple postprocessing to clean up the text.\n\n# Credits\n\nThis is built on some amazing open source work, including:\n\n- [pypdfium2](https://github.com/pypdfium2-team/pypdfium2)\n- [scikit-learn](https://scikit-learn.org/stable/index.html)\n- [pypdf](https://github.com/py-pdf/benchmarks) for very thorough and fair benchmarks\n\nThank you to the [pymupdf](https://github.com/pymupdf/PyMuPDF) devs for creating such a great library - I just wish it had a simpler license!",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Extract structured text from pdfs quickly",
"version": "0.4.0",
"project_urls": {
"Homepage": "https://github.com/VikParuchuri/pdftext",
"Repository": "https://github.com/VikParuchuri/pdftext"
},
"split_keywords": [
"pdf",
" text",
" extraction"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f9ed322bd513d0da0dfcb7b2b118ce676442175d484c1515478092c6910c4a4a",
"md5": "3d8adbb3e1072ac56fdd68588fd10a45",
"sha256": "3947ce8eae504788c450e6d8b4927d3726c41c8dd77197f6496be1768b118525"
},
"downloads": -1,
"filename": "pdftext-0.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3d8adbb3e1072ac56fdd68588fd10a45",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 16598,
"upload_time": "2024-12-12T16:12:36",
"upload_time_iso_8601": "2024-12-12T16:12:36.698351Z",
"url": "https://files.pythonhosted.org/packages/f9/ed/322bd513d0da0dfcb7b2b118ce676442175d484c1515478092c6910c4a4a/pdftext-0.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7d71980b6890894cc956320e8e5378af71431f0dc1b13022e2ded19971567846",
"md5": "20a5d7e255a5922c04c4d47a6f1bc8cf",
"sha256": "90819ba233a3ab37fad924c94fdc15b961a032d18306bbaa7f02d5d42b4484ac"
},
"downloads": -1,
"filename": "pdftext-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "20a5d7e255a5922c04c4d47a6f1bc8cf",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 16615,
"upload_time": "2024-12-12T16:12:38",
"upload_time_iso_8601": "2024-12-12T16:12:38.150205Z",
"url": "https://files.pythonhosted.org/packages/7d/71/980b6890894cc956320e8e5378af71431f0dc1b13022e2ded19971567846/pdftext-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-12 16:12:38",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "VikParuchuri",
"github_project": "pdftext",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pdftext"
}