pdf-table2json


Namepdf-table2json JSON
Version 2.0.1 PyPI version JSON
download
home_pagehttps://github.com/yousojeong/pdf_table2json/
SummaryPDF Table to JSON Converter
upload_time2023-10-12 19:41:45
maintainer
docs_urlNone
authorhielosan
requires_python>=3.8
licenseGNU AFFERO GPL 3.0
keywords pdf table json converter cv opencv
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pdf-table2json
Extract tables data from pdf files To JSON

- Locate the table with oepncv and read the contents with a text reader (Your table should be blocked by a border)
- (If you don't have a border, add a border through adjustment)
- The pdf must be readable by a text reader. Drag on pdf to see if the text is captured.
- Check version before install
    - [1.0.1](#version-101) : Only the basic table is supported
    - [2.0.1](#version-201-or-higher) : Handling tables with separate headers or cells [(example)](#version-201-or-higher)

## Current Status(Change the values ​​if adjustments are needed)
- Finds a table with a horizontal length greater than 1000 and a height greater than 100.
- Cells are excluded if their width or height is equal to the width or height of the table, or if the width or height of the cell is less than 10.
- Adding border lines to areas with a color of (230, 230, 230) and a width of 1000 or more to recognize them as table regions.
- Removing watermark images with a color of (213, 213, 213 == #D5D5D5).
- The specific string list is removed from the PDF text for the purpose of removing text watermarks (Currently empty).

## Installation
- Rquired Python >= 3.8
- install with pip
```py
pip install pdf-table2json 
```

## Example
#### import
```py
import pdf_table2json.converter as converter

path = "PATH/PDF_NAME.pdf"
result = converter.main(path, json_file_out=True, image_file_out=True)
print(result)
```

#### CLI
```py
python converter.py -i "pdf_path/pdf_name.pdf" [-j] [-o]
```
- "-i", "--input", required=True, help="[Required] Input PDF file path"
- "-j", "--json_file", action="store_true", help="[Optional] Create JSON Data file"
- "-o", "--image_file", action="store_true", help="[Optional] Save Image Data file"



## version-1.0.1
- Only the basic table is supported. (Supports only tables with horizontal headers).
- The number of headers and the number of cells must be the same
- Example Table
    | Header 1 | Header 2 | Header 3 |
    |:--------:|:--------:|:--------:|
    |   cel1   |   cel2   |   cel3   |
    |   cel1   |   cel2   |   cel3   |
    |   cel1   |   cel2   |   cel3   |

- Example
    - `converter.py`
    
    ```py
    import pdf_table2json.converter as converter

    path = "PATH/PDF_NAME.pdf"
    result = converter.main(path, json_file_out=True, image_file_out=True)
    print(result)
    ```

## version-2.0.1 or Higher
1. Tables in general format that can be processed in version 1.0.1 can be processed.
    - Example Table
        | Header 1 | Header 2 | Header 3 |
        |:--------:|:--------:|:--------:|
        |   cel1   |   cel2   |   cel3   |
        |   cel1   |   cel2   |   cel3   |
        |   cel1   |   cel2   |   cel3   |

2. Table with separated header and subheader
    - Example Table
        <table>
        <tr>
        <th rowspan="2">Header 1</th>
        <th style="text-align:center" colspan="2">Header 2</th>
        </tr>
        <tr>
        <th>Sub Header 1</th>
        <th>Sub Header 2</th>
        </tr>
        <tr>
        <td>cel1</td>
        <td>cel2</td>
        <td>cel3</td>
        </tr>
        <tr>
        <td>cel1</td>
        <td>cel2</td>
        <td>cel3</td>
        </tr>
        <tr>
        <td>cel1</td>
        <td>cel2</td>
        <td>cel3</td>
        </tr>
        </table>

    - Output
        - Delete separated parent header, use child header

            ```
            Header 1 : cel1
            Sub Header 1 : cel2
            Sub Header 2 : cel3
            ```

3. Tables with columns separated, except for the first cell
    - Example Table
        <table>
        <tr>
        <th>Header 1</th>
        <th>Header 2</th>
        <th>Header 3</th>
        </tr>
        <tr>
        <td>cel1</td>
        <td>cel2</td>
        <td>cel3</td>
        </tr>
        <tr>
        <td rowspan=2>cel1</td>
        <td>cel2-1</td>
        <td>cel3-1</td>
        </tr>
        <tr>
        <td>cel2-2</td>
        <td>cel3-2</td>
        </tr>
        </table>

    - Output
        - Add to data in the top row (with "@")
            
            ```
            Header 1 : cel1
            Header 2 : cel2
            Header 3 : cel3
            Header 1 : cel1
            Header 2 : cel2-1@cel2-2
            Header 3 : cel3-1@cel3-2
            ```

- Use Example
    - `converter_2.py`

    ```py
    import pdf_table2json.converter_2 as converter_2

    path = "PATH/PDF_NAME.pdf"
    result = converter_2.main(path, json_file_out=True, image_file_out=True)
    print(result)
    ```



## License
- GPL-3.0 license

## Contact
- [Reporting a bug](https://github.com/yousojeong/pdf-table-extract/issues)
- [@yousojeong](https://github.com/yousojeong)

## Read Text From PDF library
- PyMuPDF [GitHub](https://github.com/pymupdf/PyMuPDF)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/yousojeong/pdf_table2json/",
    "name": "pdf-table2json",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "pdf,table,json,converter,cv,openCV",
    "author": "hielosan",
    "author_email": "hielosan@naver.com",
    "download_url": "https://files.pythonhosted.org/packages/a8/63/ccba587fad8b719f6d4ccd1d2218638df071beca9d8157a08d9298122d4d/pdf_table2json-2.0.1.tar.gz",
    "platform": null,
    "description": "# pdf-table2json\nExtract tables data from pdf files To JSON\n\n- Locate the table with oepncv and read the contents with a text reader (Your table should be blocked by a border)\n- (If you don't have a border, add a border through adjustment)\n- The pdf must be readable by a text reader. Drag on pdf to see if the text is captured.\n- Check version before install\n    - [1.0.1](#version-101) : Only the basic table is supported\n    - [2.0.1](#version-201-or-higher) : Handling tables with separate headers or cells [(example)](#version-201-or-higher)\n\n## Current Status(Change the values \u200b\u200bif adjustments are needed)\n- Finds a table with a horizontal length greater than 1000 and a height greater than 100.\n- Cells are excluded if their width or height is equal to the width or height of the table, or if the width or height of the cell is less than 10.\n- Adding border lines to areas with a color of (230, 230, 230) and a width of 1000 or more to recognize them as table regions.\n- Removing watermark images with a color of (213, 213, 213 == #D5D5D5).\n- The specific string list is removed from the PDF text for the purpose of removing text watermarks (Currently empty).\n\n## Installation\n- Rquired Python >= 3.8\n- install with pip\n```py\npip install pdf-table2json \n```\n\n## Example\n#### import\n```py\nimport pdf_table2json.converter as converter\n\npath = \"PATH/PDF_NAME.pdf\"\nresult = converter.main(path, json_file_out=True, image_file_out=True)\nprint(result)\n```\n\n#### CLI\n```py\npython converter.py -i \"pdf_path/pdf_name.pdf\" [-j] [-o]\n```\n- \"-i\", \"--input\", required=True, help=\"[Required] Input PDF file path\"\n- \"-j\", \"--json_file\", action=\"store_true\", help=\"[Optional] Create JSON Data file\"\n- \"-o\", \"--image_file\", action=\"store_true\", help=\"[Optional] Save Image Data file\"\n\n\n\n## version-1.0.1\n- Only the basic table is supported. (Supports only tables with horizontal headers).\n- The number of headers and the number of cells must be the same\n- Example Table\n    | Header 1 | Header 2 | Header 3 |\n    |:--------:|:--------:|:--------:|\n    |   cel1   |   cel2   |   cel3   |\n    |   cel1   |   cel2   |   cel3   |\n    |   cel1   |   cel2   |   cel3   |\n\n- Example\n    - `converter.py`\n    \n    ```py\n    import pdf_table2json.converter as converter\n\n    path = \"PATH/PDF_NAME.pdf\"\n    result = converter.main(path, json_file_out=True, image_file_out=True)\n    print(result)\n    ```\n\n## version-2.0.1 or Higher\n1. Tables in general format that can be processed in version 1.0.1 can be processed.\n    - Example Table\n        | Header 1 | Header 2 | Header 3 |\n        |:--------:|:--------:|:--------:|\n        |   cel1   |   cel2   |   cel3   |\n        |   cel1   |   cel2   |   cel3   |\n        |   cel1   |   cel2   |   cel3   |\n\n2. Table with separated header and subheader\n    - Example Table\n        <table>\n        <tr>\n        <th rowspan=\"2\">Header 1</th>\n        <th style=\"text-align:center\" colspan=\"2\">Header 2</th>\n        </tr>\n        <tr>\n        <th>Sub Header 1</th>\n        <th>Sub Header 2</th>\n        </tr>\n        <tr>\n        <td>cel1</td>\n        <td>cel2</td>\n        <td>cel3</td>\n        </tr>\n        <tr>\n        <td>cel1</td>\n        <td>cel2</td>\n        <td>cel3</td>\n        </tr>\n        <tr>\n        <td>cel1</td>\n        <td>cel2</td>\n        <td>cel3</td>\n        </tr>\n        </table>\n\n    - Output\n        - Delete separated parent header, use child header\n\n            ```\n            Header 1 : cel1\n            Sub Header 1 : cel2\n            Sub Header 2 : cel3\n            ```\n\n3. Tables with columns separated, except for the first cell\n    - Example Table\n        <table>\n        <tr>\n        <th>Header 1</th>\n        <th>Header 2</th>\n        <th>Header 3</th>\n        </tr>\n        <tr>\n        <td>cel1</td>\n        <td>cel2</td>\n        <td>cel3</td>\n        </tr>\n        <tr>\n        <td rowspan=2>cel1</td>\n        <td>cel2-1</td>\n        <td>cel3-1</td>\n        </tr>\n        <tr>\n        <td>cel2-2</td>\n        <td>cel3-2</td>\n        </tr>\n        </table>\n\n    - Output\n        - Add to data in the top row (with \"@\")\n            \n            ```\n            Header 1 : cel1\n            Header 2 : cel2\n            Header 3 : cel3\n            Header 1 : cel1\n            Header 2 : cel2-1@cel2-2\n            Header 3 : cel3-1@cel3-2\n            ```\n\n- Use Example\n    - `converter_2.py`\n\n    ```py\n    import pdf_table2json.converter_2 as converter_2\n\n    path = \"PATH/PDF_NAME.pdf\"\n    result = converter_2.main(path, json_file_out=True, image_file_out=True)\n    print(result)\n    ```\n\n\n\n## License\n- GPL-3.0 license\n\n## Contact\n- [Reporting a bug](https://github.com/yousojeong/pdf-table-extract/issues)\n- [@yousojeong](https://github.com/yousojeong)\n\n## Read Text From PDF library\n- PyMuPDF [GitHub](https://github.com/pymupdf/PyMuPDF)\n",
    "bugtrack_url": null,
    "license": "GNU AFFERO GPL 3.0",
    "summary": "PDF Table to JSON Converter",
    "version": "2.0.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/yousojeong/pdf_table2json/issues",
        "Homepage": "https://github.com/yousojeong/pdf_table2json/"
    },
    "split_keywords": [
        "pdf",
        "table",
        "json",
        "converter",
        "cv",
        "opencv"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "192be046c541d689691adb24d87160b07268e6c91fcfe6aa4c10798503fea067",
                "md5": "4c6ed507be458b3ad35a604984561daf",
                "sha256": "d9e037dc7955fd4e9e068c90034b56590889b589135e52ddcaf9fe2adf86fbd7"
            },
            "downloads": -1,
            "filename": "pdf_table2json-2.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4c6ed507be458b3ad35a604984561daf",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 26238,
            "upload_time": "2023-10-12T19:41:43",
            "upload_time_iso_8601": "2023-10-12T19:41:43.529309Z",
            "url": "https://files.pythonhosted.org/packages/19/2b/e046c541d689691adb24d87160b07268e6c91fcfe6aa4c10798503fea067/pdf_table2json-2.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a863ccba587fad8b719f6d4ccd1d2218638df071beca9d8157a08d9298122d4d",
                "md5": "8a08c3ef47266341802bcec1c4e7753f",
                "sha256": "8de58c3087db10f15f5652b8c5b3b5d190ddd1df2079b232b327904746edefa2"
            },
            "downloads": -1,
            "filename": "pdf_table2json-2.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "8a08c3ef47266341802bcec1c4e7753f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 22618,
            "upload_time": "2023-10-12T19:41:45",
            "upload_time_iso_8601": "2023-10-12T19:41:45.172635Z",
            "url": "https://files.pythonhosted.org/packages/a8/63/ccba587fad8b719f6d4ccd1d2218638df071beca9d8157a08d9298122d4d/pdf_table2json-2.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-12 19:41:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yousojeong",
    "github_project": "pdf_table2json",
    "github_not_found": true,
    "lcname": "pdf-table2json"
}
        
Elapsed time: 0.13751s