tabled-pdf

Name	tabled-pdf JSON
Version	0.1.6 JSON
	download
home_page	https://github.com/VikParuchuri/tabled
Summary	Detect and recognize tables in PDFs and images.
upload_time	2024-11-28 11:40:15
maintainer	None
docs_url	None
author	Vik Paruchuri
requires_python	<4.0,>=3.10
license	GPL-3.0-or-later
keywords	table table-recognition ocr pdf
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Tabled

Tabled is a small library for detecting and extracting tables.  It uses [surya](https://www.github.com/VikParuchuri/surya) to find all the tables in a PDF, identifies the rows/columns, and formats cells into markdown, csv, or html.

## Example

![Table image 0](static/images/table_example.png)


| Characteristic     |       |       | Population   |       |       |       | Change from   2016 to 2060   |         |
|--------------------|-------|-------|--------------|-------|-------|-------|------------------------------|---------|
|                    | 2016  | 2020  | 2030         | 2040  | 2050  | 2060  | Number                       | Percent |
| Total population   | 323.1 | 332.6 | 355.1        | 373.5 | 388.9 | 404.5 | 81.4                         | 25.2    |
| Under 18 years     | 73.6  | 74.0  | 75.7         | 77.1  | 78.2  | 80.1  | 6.5                          | 8.8     |
| 18 to 44 years     | 116.0 | 119.2 | 125.0        | 126.4 | 129.6 | 132.7 | 16.7                         | 14.4    |
| 45 to 64 years     | 84.3  | 83.4  | 81.3         | 89.1  | 95.4  | 97.0  | 12.7                         | 15.1    |
| 65 years and over  | 49.2  | 56.1  | 73.1         | 80.8  | 85.7  | 94.7  | 45.4                         | 92.3    |
| 85 years and over  | 6.4   | 6.7   | 9.1          | 14.4  | 18.6  | 19.0  | 12.6                         | 198.1   |
| 100 years and over | 0.1   | 0.1   | 0.1          | 0.2   | 0.4   | 0.6   | 0.5                          | 618.3   |


## Community

[Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.

# Hosted API

There is a hosted API for tabled available [here](https://www.datalab.to/):

- Works with PDF, images, word docs, and powerpoints
- Consistent speed, with no latency spikes
- High reliability and uptime

# Commercial usage

I want tabled to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.

The weights for the models are licensed `cc-by-nc-sa-4.0`, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the [Datalab API](https://www.datalab.to/).  If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to).

# Installation

You'll need python 3.10+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine.  See [here](https://pytorch.org/get-started/locally/) for more details.

Install with:

```shell
pip install tabled-pdf
```

Post-install:

- Inspect the settings in `tabled/settings.py`.  You can override any settings with environment variables.
- Your torch device will be automatically detected, but you can override this.  For example, `TORCH_DEVICE=cuda`.
- Model weights will automatically download the first time you run tabled.

# Usage

```shell
tabled DATA_PATH
```

- `DATA_PATH` can be an image, pdf, or folder of images/pdfs
- `--format` specifies output format for each table (`markdown`, `html`, or `csv`)
- `--save_json` saves additional row and column information in a json file
- `--save_debug_images` saves images showing the detected rows and columns
- `--skip_detection` means that the images you pass in are all cropped tables and don't need any table detection.
- `--detect_cell_boxes` by default, tabled will attempt to pull cell information out of the pdf.  If you instead want cells to be detected by a detection model, specify this (usually you only need this with pdfs that have bad embedded text).
- `--save_images` specifies that images of detected rows/columns and cells should be saved.

After running the script, the output directory will contain folders with the same basenames as the input filenames.  Inside those folders will be the markdown files for each table in the source documents.  There will also optionally be images of the tables.

There will also be a `results.json` file in the root of the output directory. The file will contain a json dictionary where the keys are the input filenames without extensions.  Each value will be a list of dictionaries, one per table in the document.  Each table dictionary contains:

- `cells` - the detected text and bounding boxes for each table cell.
  - `bbox` - bbox of the cell within the table bbox
  - `text` - the text of the cell
  - `row_ids` - ids of rows the cell belongs to
  - `col_ids` - ids of columns the cell belongs to
  - `order` - order of this cell within its assigned row/column cell.  (sort by row, then column, then order)
- `rows` - bboxes of the detected rows
  - `bbox` - bbox of the row in (x1, x2, y1, y2) format
  - `row_id` - unique id of the row
- `cols` - bboxes of detected columns
  - `bbox` - bbox of the column in (x1, x2, y1, y2) format
  - `col_id` - unique id of the column
- `image_bbox` - the bbox for the image in (x1, y1, x2, y2) format.  (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.  The table bbox is relative to this.
- `bbox` - the bounding box of the table within the image bbox.
- `pnum` - page number within the document
- `tnum` - table index on the page

## Interactive App

I've included a streamlit app that lets you interactively try tabled on images or PDF files.  Run it with:

```shell
pip install streamlit
tabled_gui
```

## From python

```python
from tabled.extract import extract_tables
from tabled.fileinput import load_pdfs_images
from tabled.inference.models import load_detection_models, load_recognition_models

det_models, rec_models = load_detection_models(), load_recognition_models()
images, highres_images, names, text_lines = load_pdfs_images(IN_PATH)

page_results = extract_tables(images, highres_images, text_lines, det_models, rec_models)
```

# Benchmarks

|   Avg score |   Time per table |   Total tables |
|-------------|------------------|----------------|
|       0.847 |            0.029 |            688 |

## Quality

Getting good ground truth data for tables is hard, since you're either constrained to simple layouts that can be heuristically parsed and rendered, or you need to use LLMs, which make mistakes.  I chose to use GPT-4 table predictions as a pseudo-ground-truth.

Tabled gets a `.847` alignment score when compared to GPT-4, which indicates alignment between the text in table rows/cells.  Some of the misalignments are due to GPT-4 mistakes, or small inconsistencies in what GPT-4 considered the borders of the table.  In general, extraction quality is quite high.

## Performance

Running on an A10G with 10GB of VRAM usage and batch size `64`, tabled takes `.029` seconds per table.

## Running the benchmark

Run the benchmark with:

```shell
python benchmarks/benchmark.py out.json
```

# Acknowledgements

- Thank you to [Peter Jansen](https://cognitiveai.org/) for the benchmarking dataset, and for discussion about table parsing.
- Huggingface for inference code and model hosting
- PyTorch for training/inference

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/VikParuchuri/tabled",
    "name": "tabled-pdf",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "table, table-recognition, ocr, pdf",
    "author": "Vik Paruchuri",
    "author_email": "github@vikas.sh",
    "download_url": "https://files.pythonhosted.org/packages/9b/2c/f34b0d88cd14a3567f0f72a9a0d4a1b1a20615f0c244e7d20186323345b7/tabled_pdf-0.1.6.tar.gz",
    "platform": null,
    "description": "# Tabled\n\nTabled is a small library for detecting and extracting tables.  It uses [surya](https://www.github.com/VikParuchuri/surya) to find all the tables in a PDF, identifies the rows/columns, and formats cells into markdown, csv, or html.\n\n## Example\n\n![Table image 0](static/images/table_example.png)\n\n\n| Characteristic     |       |       | Population   |       |       |       | Change from   2016 to 2060   |         |\n|--------------------|-------|-------|--------------|-------|-------|-------|------------------------------|---------|\n|                    | 2016  | 2020  | 2030         | 2040  | 2050  | 2060  | Number                       | Percent |\n| Total population   | 323.1 | 332.6 | 355.1        | 373.5 | 388.9 | 404.5 | 81.4                         | 25.2    |\n| Under 18 years     | 73.6  | 74.0  | 75.7         | 77.1  | 78.2  | 80.1  | 6.5                          | 8.8     |\n| 18 to 44 years     | 116.0 | 119.2 | 125.0        | 126.4 | 129.6 | 132.7 | 16.7                         | 14.4    |\n| 45 to 64 years     | 84.3  | 83.4  | 81.3         | 89.1  | 95.4  | 97.0  | 12.7                         | 15.1    |\n| 65 years and over  | 49.2  | 56.1  | 73.1         | 80.8  | 85.7  | 94.7  | 45.4                         | 92.3    |\n| 85 years and over  | 6.4   | 6.7   | 9.1          | 14.4  | 18.6  | 19.0  | 12.6                         | 198.1   |\n| 100 years and over | 0.1   | 0.1   | 0.1          | 0.2   | 0.4   | 0.6   | 0.5                          | 618.3   |\n\n\n## Community\n\n[Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.\n\n# Hosted API\n\nThere is a hosted API for tabled available [here](https://www.datalab.to/):\n\n- Works with PDF, images, word docs, and powerpoints\n- Consistent speed, with no latency spikes\n- High reliability and uptime\n\n# Commercial usage\n\nI want tabled to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.\n\nThe weights for the models are licensed `cc-by-nc-sa-4.0`, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the [Datalab API](https://www.datalab.to/).  If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to).\n\n# Installation\n\nYou'll need python 3.10+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine.  See [here](https://pytorch.org/get-started/locally/) for more details.\n\nInstall with:\n\n```shell\npip install tabled-pdf\n```\n\nPost-install:\n\n- Inspect the settings in `tabled/settings.py`.  You can override any settings with environment variables.\n- Your torch device will be automatically detected, but you can override this.  For example, `TORCH_DEVICE=cuda`.\n- Model weights will automatically download the first time you run tabled.\n\n# Usage\n\n```shell\ntabled DATA_PATH\n```\n\n- `DATA_PATH` can be an image, pdf, or folder of images/pdfs\n- `--format` specifies output format for each table (`markdown`, `html`, or `csv`)\n- `--save_json` saves additional row and column information in a json file\n- `--save_debug_images` saves images showing the detected rows and columns\n- `--skip_detection` means that the images you pass in are all cropped tables and don't need any table detection.\n- `--detect_cell_boxes` by default, tabled will attempt to pull cell information out of the pdf.  If you instead want cells to be detected by a detection model, specify this (usually you only need this with pdfs that have bad embedded text).\n- `--save_images` specifies that images of detected rows/columns and cells should be saved.\n\nAfter running the script, the output directory will contain folders with the same basenames as the input filenames.  Inside those folders will be the markdown files for each table in the source documents.  There will also optionally be images of the tables.\n\nThere will also be a `results.json` file in the root of the output directory. The file will contain a json dictionary where the keys are the input filenames without extensions.  Each value will be a list of dictionaries, one per table in the document.  Each table dictionary contains:\n\n- `cells` - the detected text and bounding boxes for each table cell.\n  - `bbox` - bbox of the cell within the table bbox\n  - `text` - the text of the cell\n  - `row_ids` - ids of rows the cell belongs to\n  - `col_ids` - ids of columns the cell belongs to\n  - `order` - order of this cell within its assigned row/column cell.  (sort by row, then column, then order)\n- `rows` - bboxes of the detected rows\n  - `bbox` - bbox of the row in (x1, x2, y1, y2) format\n  - `row_id` - unique id of the row\n- `cols` - bboxes of detected columns\n  - `bbox` - bbox of the column in (x1, x2, y1, y2) format\n  - `col_id` - unique id of the column\n- `image_bbox` - the bbox for the image in (x1, y1, x2, y2) format.  (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.  The table bbox is relative to this.\n- `bbox` - the bounding box of the table within the image bbox.\n- `pnum` - page number within the document\n- `tnum` - table index on the page\n\n## Interactive App\n\nI've included a streamlit app that lets you interactively try tabled on images or PDF files.  Run it with:\n\n```shell\npip install streamlit\ntabled_gui\n```\n\n## From python\n\n```python\nfrom tabled.extract import extract_tables\nfrom tabled.fileinput import load_pdfs_images\nfrom tabled.inference.models import load_detection_models, load_recognition_models\n\ndet_models, rec_models = load_detection_models(), load_recognition_models()\nimages, highres_images, names, text_lines = load_pdfs_images(IN_PATH)\n\npage_results = extract_tables(images, highres_images, text_lines, det_models, rec_models)\n```\n\n# Benchmarks\n\n|   Avg score |   Time per table |   Total tables |\n|-------------|------------------|----------------|\n|       0.847 |            0.029 |            688 |\n\n## Quality\n\nGetting good ground truth data for tables is hard, since you're either constrained to simple layouts that can be heuristically parsed and rendered, or you need to use LLMs, which make mistakes.  I chose to use GPT-4 table predictions as a pseudo-ground-truth.\n\nTabled gets a `.847` alignment score when compared to GPT-4, which indicates alignment between the text in table rows/cells.  Some of the misalignments are due to GPT-4 mistakes, or small inconsistencies in what GPT-4 considered the borders of the table.  In general, extraction quality is quite high.\n\n## Performance\n\nRunning on an A10G with 10GB of VRAM usage and batch size `64`, tabled takes `.029` seconds per table.\n\n## Running the benchmark\n\nRun the benchmark with:\n\n```shell\npython benchmarks/benchmark.py out.json\n```\n\n# Acknowledgements\n\n- Thank you to [Peter Jansen](https://cognitiveai.org/) for the benchmarking dataset, and for discussion about table parsing.\n- Huggingface for inference code and model hosting\n- PyTorch for training/inference",
    "bugtrack_url": null,
    "license": "GPL-3.0-or-later",
    "summary": "Detect and recognize tables in PDFs and images.",
    "version": "0.1.6",
    "project_urls": {
        "Homepage": "https://github.com/VikParuchuri/tabled",
        "Repository": "https://github.com/VikParuchuri/tabled"
    },
    "split_keywords": [
        "table",
        " table-recognition",
        " ocr",
        " pdf"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fcd829dc086bb1dd177efa62d250b8316ecca47b392fc4960c6570a0b323c1b1",
                "md5": "181c101bf899c8ed663017ffb9f95646",
                "sha256": "c88c8af01b5e82ee073e045d200537b7396e46644710ad4b3d560c60813a78fc"
            },
            "downloads": -1,
            "filename": "tabled_pdf-0.1.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "181c101bf899c8ed663017ffb9f95646",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 31489,
            "upload_time": "2024-11-28T11:40:14",
            "upload_time_iso_8601": "2024-11-28T11:40:14.554373Z",
            "url": "https://files.pythonhosted.org/packages/fc/d8/29dc086bb1dd177efa62d250b8316ecca47b392fc4960c6570a0b323c1b1/tabled_pdf-0.1.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9b2cf34b0d88cd14a3567f0f72a9a0d4a1b1a20615f0c244e7d20186323345b7",
                "md5": "490d66014b38bff82c655ce07881e647",
                "sha256": "3beef106596e078e9bb53d9f4199528da91b7489adb87f46df280373214130e2"
            },
            "downloads": -1,
            "filename": "tabled_pdf-0.1.6.tar.gz",
            "has_sig": false,
            "md5_digest": "490d66014b38bff82c655ce07881e647",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 28860,
            "upload_time": "2024-11-28T11:40:15",
            "upload_time_iso_8601": "2024-11-28T11:40:15.727579Z",
            "url": "https://files.pythonhosted.org/packages/9b/2c/f34b0d88cd14a3567f0f72a9a0d4a1b1a20615f0c244e7d20186323345b7/tabled_pdf-0.1.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-28 11:40:15",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "VikParuchuri",
    "github_project": "tabled",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "tabled-pdf"
}

Vik Paruchuri