pdftext


Namepdftext JSON
Version 0.4.0 PyPI version JSON
download
home_pagehttps://github.com/VikParuchuri/pdftext
SummaryExtract structured text from pdfs quickly
upload_time2024-12-12 16:12:38
maintainerNone
docs_urlNone
authorVik Paruchuri
requires_python<4.0,>=3.10
licenseApache-2.0
keywords pdf text extraction
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PDFText

Text extraction like [PyMuPDF](https://github.com/pymupdf/PyMuPDF), but without the AGPL license.  PDFText extracts plain text or structured blocks and lines.  It's built on [pypdfium2](https://github.com/pypdfium2-team/pypdfium2), so it's [fast, accurate](#benchmarks), and Apache licensed.

## Community

[Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.

# Installation

You'll need python 3.9+ first.  Then run `pip install pdftext`.

# Usage

- Inspect the settings in `pdftext/settings.py`.  You can override any settings with environment variables.

## Plain text

This command will write out a text file with the extracted plain text.

```shell
pdftext PDF_PATH --out_path output.txt
```

- `PDF_PATH` must be a single pdf file.
- `--out_path` path to the output txt file.  If not specified, will write to stdout.
- `--sort` will attempt to sort in reading order if specified.
- `--keep_hyphens` will keep hyphens in the output (they will be stripped and words joined otherwise)
- `--pages` will specify pages (comma separated) to extract
- `--workers` specifies the number of parallel workers to use
- `--flatten_pdf` merges form fields into the PDF

## JSON

This command outputs structured blocks and lines with font and other information.

```shell
pdftext PDF_PATH --out_path output.txt --json
```

- `PDF_PATH` must be a single pdf file.
- `--out_path` path to the output txt file.  If not specified, will write to stdout.
- `--json` specifies json output
- `--sort` will attempt to sort in reading order if specified.
- `--pages` will specify pages (comma separated) to extract
- `--keep_chars` will keep individual characters in the json output
- `--workers` specifies the number of parallel workers to use
- `--flatten_pdf` merges form fields into the PDF

The output will be a json list, with each item in the list corresponding to a single page in the input pdf (in order).  Each page will include the following keys:

- `bbox` - the page bbox, in `[x1, y1, x2, y2]` format
- `rotation` - how much the page is rotated, in degrees (`0`, `90`, `180`, or `270`)
- `page` - the index of the page
- `blocks` - the blocks that make up the text in the pdf.  Approximately equal to a paragraph.
  - `bbox` - the block bbox, in `[x1, y1, x2, y2]` format
  - `lines` - the lines inside the block
    - `bbox` - the line bbox, in `[x1, y1, x2, y2]` format
    - `spans` - the individual text spans in the line (text spans have the same font/weight/etc)
      - `text` - the text in the span, encoded in utf-8
      - `rotation` - how much the span is rotated, in degrees
      - `bbox` - the span bbox, in `[x1, y1, x2, y2]` format
      - `char_start_idx` - the start index of the first span character in the pdf
      - `char_end_idx` - the end index of the last span character in the pdf
      - `font` this is font info straight from the pdf, see [this pdfium code](https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_text.h)
        - `size` - the size of the font used for the text
        - `weight` - font weight
        - `name` - font name, may be None
        - `flags` - font flags, in the format of the `PDF spec 1.7 Section 5.7.1 Font Descriptor Flags`

If the pdf is rotated, the bboxes will be relative to the rotated page (they're rotated after being extracted).

# Programmatic usage

Extract plain text:

```python
from pdftext.extraction import plain_text_output

text = plain_text_output(PDF_PATH, sort=False, hyphens=False, page_range=[1,2,3]) # Optional arguments explained above
```

Extract structured blocks and lines:

```python
from pdftext.extraction import dictionary_output

text = dictionary_output(PDF_PATH, sort=False, page_range=[1,2,3], keep_chars=False) # Optional arguments explained above
```

If you want more customization, check out the `pdftext.extraction._get_pages` function for a starting point to dig deeper.  pdftext is a pretty thin wrapper around [pypdfium2](https://pypdfium2.readthedocs.io/en/stable/), so you might want to look at the documentation for that as well.

# Benchmarks

I benchmarked extraction speed and accuracy of [pymupdf](https://pymupdf.readthedocs.io/en/latest/), [pdfplumber](https://github.com/jsvine/pdfplumber), and pdftext.  I chose pymupdf because it extracts blocks and lines.  Pdfplumber extracts words and bboxes.  I did not benchmark pypdf, even though it is a great library, because it doesn't provide individual character/line/block and bbox information.

Here are the scores, run on an M1 Macbook, without multiprocessing:

| Library    | Time (s per page) | Alignment Score (% accuracy vs pymupdf) |
|------------|-------------------|-----------------------------------------|
| pymupdf    | 0.32              | --                                      |
| pdftext    | 1.4               | 97.76                                   |
| pdfplumber | 3.0               | 90.3                                    |

pdftext is approximately 2x slower than using pypdfium2 alone (if you were to extract all the same character information).

There are additional benchmarks for pypdfium2 and other tools [here](https://github.com/py-pdf/benchmarks).

## Methodology

I used a benchmark set of 200 pdfs extracted from [common crawl](https://huggingface.co/datasets/pixparse/pdfa-eng-wds), then processed by a team at HuggingFace.

For each library, I used a detailed extraction method, to pull out font information, as well as just the words.  This ensured we were comparing similar performance numbers.  I formatted the text similarly when extracting - newlines after lines, and double newlines after blocks.  For pdfplumber, I could only do the newlines after lines, since it doesn't recognize blocks.

For the alignment score, I extracted the text, then used the rapidfuzz library to find the alignment percentage.  I used the text extracted by pymupdf as the pseudo-ground truth.

## Running benchmarks

You can run the benchmarks yourself.  To do so, you have to first install pdftext manually.  The install assumes you have poetry and Python 3.9+ installed.

```shell
git clone https://github.com/VikParuchuri/pdftext.git
cd pdftext
poetry install
python benchmark.py # Will download the benchmark pdfs automatically
```

The benchmark script has a few options:

- `--max` this controls the maximum number of pdfs to benchmark
- `--result_path` a folder to save the results.  A file called `results.json` will be created in the folder.
- `--pdftext_only` skip running pdfplumber, which can be slow.

# How it works

PDFText is a very light wrapper around pypdfium2.  It first uses pypdfium2 to extract characters in order, along with font and other information.  Then it uses a simple decision tree algorithm to group characters into lines and blocks.  It does some simple postprocessing to clean up the text.

# Credits

This is built on some amazing open source work, including:

- [pypdfium2](https://github.com/pypdfium2-team/pypdfium2)
- [scikit-learn](https://scikit-learn.org/stable/index.html)
- [pypdf](https://github.com/py-pdf/benchmarks) for very thorough and fair benchmarks

Thank you to the [pymupdf](https://github.com/pymupdf/PyMuPDF) devs for creating such a great library - I just wish it had a simpler license!
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/VikParuchuri/pdftext",
    "name": "pdftext",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "pdf, text, extraction",
    "author": "Vik Paruchuri",
    "author_email": "vik.paruchuri@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/7d/71/980b6890894cc956320e8e5378af71431f0dc1b13022e2ded19971567846/pdftext-0.4.0.tar.gz",
    "platform": null,
    "description": "# PDFText\n\nText extraction like [PyMuPDF](https://github.com/pymupdf/PyMuPDF), but without the AGPL license.  PDFText extracts plain text or structured blocks and lines.  It's built on [pypdfium2](https://github.com/pypdfium2-team/pypdfium2), so it's [fast, accurate](#benchmarks), and Apache licensed.\n\n## Community\n\n[Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.\n\n# Installation\n\nYou'll need python 3.9+ first.  Then run `pip install pdftext`.\n\n# Usage\n\n- Inspect the settings in `pdftext/settings.py`.  You can override any settings with environment variables.\n\n## Plain text\n\nThis command will write out a text file with the extracted plain text.\n\n```shell\npdftext PDF_PATH --out_path output.txt\n```\n\n- `PDF_PATH` must be a single pdf file.\n- `--out_path` path to the output txt file.  If not specified, will write to stdout.\n- `--sort` will attempt to sort in reading order if specified.\n- `--keep_hyphens` will keep hyphens in the output (they will be stripped and words joined otherwise)\n- `--pages` will specify pages (comma separated) to extract\n- `--workers` specifies the number of parallel workers to use\n- `--flatten_pdf` merges form fields into the PDF\n\n## JSON\n\nThis command outputs structured blocks and lines with font and other information.\n\n```shell\npdftext PDF_PATH --out_path output.txt --json\n```\n\n- `PDF_PATH` must be a single pdf file.\n- `--out_path` path to the output txt file.  If not specified, will write to stdout.\n- `--json` specifies json output\n- `--sort` will attempt to sort in reading order if specified.\n- `--pages` will specify pages (comma separated) to extract\n- `--keep_chars` will keep individual characters in the json output\n- `--workers` specifies the number of parallel workers to use\n- `--flatten_pdf` merges form fields into the PDF\n\nThe output will be a json list, with each item in the list corresponding to a single page in the input pdf (in order).  Each page will include the following keys:\n\n- `bbox` - the page bbox, in `[x1, y1, x2, y2]` format\n- `rotation` - how much the page is rotated, in degrees (`0`, `90`, `180`, or `270`)\n- `page` - the index of the page\n- `blocks` - the blocks that make up the text in the pdf.  Approximately equal to a paragraph.\n  - `bbox` - the block bbox, in `[x1, y1, x2, y2]` format\n  - `lines` - the lines inside the block\n    - `bbox` - the line bbox, in `[x1, y1, x2, y2]` format\n    - `spans` - the individual text spans in the line (text spans have the same font/weight/etc)\n      - `text` - the text in the span, encoded in utf-8\n      - `rotation` - how much the span is rotated, in degrees\n      - `bbox` - the span bbox, in `[x1, y1, x2, y2]` format\n      - `char_start_idx` - the start index of the first span character in the pdf\n      - `char_end_idx` - the end index of the last span character in the pdf\n      - `font` this is font info straight from the pdf, see [this pdfium code](https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_text.h)\n        - `size` - the size of the font used for the text\n        - `weight` - font weight\n        - `name` - font name, may be None\n        - `flags` - font flags, in the format of the `PDF spec 1.7 Section 5.7.1 Font Descriptor Flags`\n\nIf the pdf is rotated, the bboxes will be relative to the rotated page (they're rotated after being extracted).\n\n# Programmatic usage\n\nExtract plain text:\n\n```python\nfrom pdftext.extraction import plain_text_output\n\ntext = plain_text_output(PDF_PATH, sort=False, hyphens=False, page_range=[1,2,3]) # Optional arguments explained above\n```\n\nExtract structured blocks and lines:\n\n```python\nfrom pdftext.extraction import dictionary_output\n\ntext = dictionary_output(PDF_PATH, sort=False, page_range=[1,2,3], keep_chars=False) # Optional arguments explained above\n```\n\nIf you want more customization, check out the `pdftext.extraction._get_pages` function for a starting point to dig deeper.  pdftext is a pretty thin wrapper around [pypdfium2](https://pypdfium2.readthedocs.io/en/stable/), so you might want to look at the documentation for that as well.\n\n# Benchmarks\n\nI benchmarked extraction speed and accuracy of [pymupdf](https://pymupdf.readthedocs.io/en/latest/), [pdfplumber](https://github.com/jsvine/pdfplumber), and pdftext.  I chose pymupdf because it extracts blocks and lines.  Pdfplumber extracts words and bboxes.  I did not benchmark pypdf, even though it is a great library, because it doesn't provide individual character/line/block and bbox information.\n\nHere are the scores, run on an M1 Macbook, without multiprocessing:\n\n| Library    | Time (s per page) | Alignment Score (% accuracy vs pymupdf) |\n|------------|-------------------|-----------------------------------------|\n| pymupdf    | 0.32              | --                                      |\n| pdftext    | 1.4               | 97.76                                   |\n| pdfplumber | 3.0               | 90.3                                    |\n\npdftext is approximately 2x slower than using pypdfium2 alone (if you were to extract all the same character information).\n\nThere are additional benchmarks for pypdfium2 and other tools [here](https://github.com/py-pdf/benchmarks).\n\n## Methodology\n\nI used a benchmark set of 200 pdfs extracted from [common crawl](https://huggingface.co/datasets/pixparse/pdfa-eng-wds), then processed by a team at HuggingFace.\n\nFor each library, I used a detailed extraction method, to pull out font information, as well as just the words.  This ensured we were comparing similar performance numbers.  I formatted the text similarly when extracting - newlines after lines, and double newlines after blocks.  For pdfplumber, I could only do the newlines after lines, since it doesn't recognize blocks.\n\nFor the alignment score, I extracted the text, then used the rapidfuzz library to find the alignment percentage.  I used the text extracted by pymupdf as the pseudo-ground truth.\n\n## Running benchmarks\n\nYou can run the benchmarks yourself.  To do so, you have to first install pdftext manually.  The install assumes you have poetry and Python 3.9+ installed.\n\n```shell\ngit clone https://github.com/VikParuchuri/pdftext.git\ncd pdftext\npoetry install\npython benchmark.py # Will download the benchmark pdfs automatically\n```\n\nThe benchmark script has a few options:\n\n- `--max` this controls the maximum number of pdfs to benchmark\n- `--result_path` a folder to save the results.  A file called `results.json` will be created in the folder.\n- `--pdftext_only` skip running pdfplumber, which can be slow.\n\n# How it works\n\nPDFText is a very light wrapper around pypdfium2.  It first uses pypdfium2 to extract characters in order, along with font and other information.  Then it uses a simple decision tree algorithm to group characters into lines and blocks.  It does some simple postprocessing to clean up the text.\n\n# Credits\n\nThis is built on some amazing open source work, including:\n\n- [pypdfium2](https://github.com/pypdfium2-team/pypdfium2)\n- [scikit-learn](https://scikit-learn.org/stable/index.html)\n- [pypdf](https://github.com/py-pdf/benchmarks) for very thorough and fair benchmarks\n\nThank you to the [pymupdf](https://github.com/pymupdf/PyMuPDF) devs for creating such a great library - I just wish it had a simpler license!",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Extract structured text from pdfs quickly",
    "version": "0.4.0",
    "project_urls": {
        "Homepage": "https://github.com/VikParuchuri/pdftext",
        "Repository": "https://github.com/VikParuchuri/pdftext"
    },
    "split_keywords": [
        "pdf",
        " text",
        " extraction"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f9ed322bd513d0da0dfcb7b2b118ce676442175d484c1515478092c6910c4a4a",
                "md5": "3d8adbb3e1072ac56fdd68588fd10a45",
                "sha256": "3947ce8eae504788c450e6d8b4927d3726c41c8dd77197f6496be1768b118525"
            },
            "downloads": -1,
            "filename": "pdftext-0.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3d8adbb3e1072ac56fdd68588fd10a45",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 16598,
            "upload_time": "2024-12-12T16:12:36",
            "upload_time_iso_8601": "2024-12-12T16:12:36.698351Z",
            "url": "https://files.pythonhosted.org/packages/f9/ed/322bd513d0da0dfcb7b2b118ce676442175d484c1515478092c6910c4a4a/pdftext-0.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7d71980b6890894cc956320e8e5378af71431f0dc1b13022e2ded19971567846",
                "md5": "20a5d7e255a5922c04c4d47a6f1bc8cf",
                "sha256": "90819ba233a3ab37fad924c94fdc15b961a032d18306bbaa7f02d5d42b4484ac"
            },
            "downloads": -1,
            "filename": "pdftext-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "20a5d7e255a5922c04c4d47a6f1bc8cf",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 16615,
            "upload_time": "2024-12-12T16:12:38",
            "upload_time_iso_8601": "2024-12-12T16:12:38.150205Z",
            "url": "https://files.pythonhosted.org/packages/7d/71/980b6890894cc956320e8e5378af71431f0dc1b13022e2ded19971567846/pdftext-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-12 16:12:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "VikParuchuri",
    "github_project": "pdftext",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pdftext"
}
        
Elapsed time: 0.39567s