Name | img2table JSON |
Version |
1.4.0
JSON |
| download |
home_page | https://github.com/xavctn/img2table |
Summary | img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing |
upload_time | 2024-11-11 19:07:31 |
maintainer | None |
docs_url | None |
author | Xavier Canton |
requires_python | <3.13,>=3.8 |
license | MIT |
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# img2table
`img2table` is a simple, easy to use, table identification and extraction Python Library based on [OpenCV](https://opencv.org/) image
processing that supports most common image file formats as well as PDF files.
Thanks to its design, it provides a practical and lighter alternative to Neural Networks based solutions, especially for usage on CPU.
## Table of contents
* [Installation](#installation)
* [Features](#features)
* [Supported file formats](#supported-file-formats)
* [Usage](#usage)
* [Documents](#documents)
* [Images](#images-doc)
* [PDF](#pdf-doc)
* [Supported OCRs](#ocr)
* [Table extraction](#table-extract)
* [Excel export](#xlsx)
* [Examples](#examples)
* [Caveats / FYI](#fyi)
## Installation <a name="installation"></a>
The library can be installed via pip:
> <code>pip install img2table</code>: Standard installation, supporting Tesseract<br>
> <code>pip install img2table[paddle]</code>: For usage with Paddle OCR<br>
> <code>pip install img2table[easyocr]</code>: For usage with EasyOCR<br>
> <code>pip install img2table[surya]</code>: For usage with Surya OCR<br>
> <code>pip install img2table[gcp]</code>: For usage with Google Vision OCR<br>
> <code>pip install img2table[aws]</code>: For usage with AWS Textract OCR<br>
> <code>pip install img2table[azure]</code>: For usage with Azure Cognitive Services OCR
## Features <a name="features"></a>
* Table identification for images and PDF files, including bounding boxes at the table cell level
* Handling of complex table structures such as merged cells
* Handling of implicit content - see [example](/examples/Implicit.ipynb)
* Table content extraction by providing support for OCR services / tools
* Extracted tables are returned as a simple object, including a Pandas DataFrame representation
* Export extracted tables to an Excel file, preserving their original structure
## Supported file formats <a name="supported-file-formats"></a>
### Images <a name="images-formats"></a>
Images are loaded using the `opencv-python` library, supported formats are listed below.
<details>
<summary>Supported image formats</summary>
<br>
<blockquote>
<ul>
<li>Windows bitmaps - <em>.bmp, </em>.dib</li>
<li>JPEG files - <em>.jpeg, </em>.jpg, *.jpe</li>
<li>JPEG 2000 files - *.jp2</li>
<li>Portable Network Graphics - *.png</li>
<li>WebP - *.webp</li>
<li>Portable image format - <em>.pbm, </em>.pgm, <em>.ppm </em>.pxm, *.pnm</li>
<li>PFM files - *.pfm</li>
<li>Sun rasters - <em>.sr, </em>.ras</li>
<li>TIFF files - <em>.tiff, </em>.tif</li>
<li>OpenEXR Image files - *.exr</li>
<li>Radiance HDR - <em>.hdr, </em>.pic</li>
<li>Raster and Vector geospatial data supported by GDAL<br>
<cite><a href="https://docs.opencv.org/4.x/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56">OpenCV: Image file reading and writing</a></cite></li>
</ul>
</blockquote>
</details>
Multi-page images are not supported.
---
### PDF <a name="pdf-formats"></a>
Both native and scanned PDF files are supported.
## Usage <a name="usage"></a>
### Documents <a name="documents"></a>
#### Images <a name="images-doc"></a>
Images are instantiated as follows :
```python
from img2table.document import Image
image = Image(src,
detect_rotation=False)
```
> <h4>Parameters</h4>
><dl>
> <dt>src : str, <code>pathlib.Path</code>, bytes or <code>io.BytesIO</code>, required</dt>
> <dd style="font-style: italic;">Image source</dd>
> <dt>detect_rotation : bool, optional, default <code>False</code></dt>
> <dd style="font-style: italic;">Detect and correct skew/rotation of the image</dd>
></dl>
<br>
The implemented method to handle skewed/rotated images supports skew angles up to 45° and is
based on the publication by <a href="https://www.mdpi.com/2079-9292/9/1/55">Huang, 2020</a>.<br>
Setting the <code>detect_rotation</code> parameter to <code>True</code>, image coordinates and bounding boxes returned by other
methods might not correspond to the original image.
#### PDF <a name="pdf-doc"></a>
PDF files are instantiated as follows :
```python
from img2table.document import PDF
pdf = PDF(src,
pages=[0, 2],
detect_rotation=False,
pdf_text_extraction=True)
```
> <h4>Parameters</h4>
><dl>
> <dt>src : str, <code>pathlib.Path</code>, bytes or <code>io.BytesIO</code>, required</dt>
> <dd style="font-style: italic;">PDF source</dd>
> <dt>pages : list, optional, default <code>None</code></dt>
> <dd style="font-style: italic;">List of PDF page indexes to be processed. If None, all pages are processed</dd>
> <dt>detect_rotation : bool, optional, default <code>False</code></dt>
> <dd style="font-style: italic;">Detect and correct skew/rotation of extracted images from the PDF</dd>
> <dt>pdf_text_extraction : bool, optional, default <code>True</code></dt>
> <dd style="font-style: italic;">Extract text from the PDF file for native PDFs</dd>
></dl>
PDF pages are converted to images with a 200 DPI for table identification.
---
### OCR <a name="ocr"></a>
`img2table` provides an interface for several OCR services and tools in order to parse table content.<br>
If possible (i.e for native PDF), PDF text will be extracted directly from the file and the OCR service/tool will not be called.
<details>
<summary>Tesseract<a name="tesseract"></a></summary>
<br>
```python
from img2table.ocr import TesseractOCR
ocr = TesseractOCR(n_threads=1,
lang="eng",
psm=11,
tessdata_dir="...")
```
> <h4>Parameters</h4>
><dl>
> <dt>n_threads : int, optional, default <code>1</code></dt>
> <dd style="font-style: italic;">Number of concurrent threads used to call Tesseract</dd>
> <dt>lang : str, optional, default <code>"eng"</code></dt>
> <dd style="font-style: italic;">Lang parameter used in Tesseract for text extraction</dd>
> <dt>psm : int, optional, default <code>11</code></dt>
> <dd style="font-style: italic;">PSM parameter used in Tesseract, run <code>tesseract --help-psm</code> for details</dd>
> <dt>tessdata_dir : str, optional, default <code>None</code></dt>
> <dd style="font-style: italic;">Directory containing Tesseract traineddata files. If None, the <code>TESSDATA_PREFIX</code> env variable is used.</dd>
></dl>
*Usage of [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract) requires prior installation.
Check [documentation](https://tesseract-ocr.github.io/tessdoc/) for instructions.*
<br>
*For Windows users getting environment variable errors, you can check this [tutorial](https://linuxhint.com/install-tesseract-windows/)*
<br>
</details>
<details>
<summary>PaddleOCR<a name="paddle"></a></summary>
<br>
<a href="https://github.com/PaddlePaddle/PaddleOCR">PaddleOCR</a> is an open-source OCR based on Deep Learning models.<br>
At first use, relevant languages models will be downloaded.
```python
from img2table.ocr import PaddleOCR
ocr = PaddleOCR(lang="en",
kw={"kwarg": kw_value, ...})
```
> <h4>Parameters</h4>
><dl>
> <dt>lang : str, optional, default <code>"en"</code></dt>
> <dd style="font-style: italic;">Lang parameter used in Paddle for text extraction, check <a href="https://github.com/Mushroomcat9998/PaddleOCR/blob/main/doc/doc_en/multi_languages_en.md#5-support-languages-and-abbreviations">documentation for available languages</a></dd>
> <dt>kw : dict, optional, default <code>None</code></dt>
> <dd style="font-style: italic;">Dictionary containing additional keyword arguments passed to the PaddleOCR constructor.</dd>
></dl>
<br>
<b>NB:</b> For usage of PaddleOCR with GPU, the CUDA specific version of paddlepaddle-gpu has to be installed by the user manually
as stated in this <a href="https://github.com/PaddlePaddle/PaddleOCR/issues/7993">issue</a>.
```bash
# Example of installation with CUDA 11.8
pip install paddlepaddle-gpu==2.5.0rc1.post118 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
pip install paddleocr img2table
```
If you get an error trying to run PaddleOCR on Ubuntu,
please check this <a href="https://github.com/PaddlePaddle/PaddleOCR/discussions/9989#discussioncomment-6642037">issue</a> for a working solution.
<br>
</details>
<details>
<summary>EasyOCR<a name="easyocr"></a></summary>
<br>
<a href="https://github.com/JaidedAI/EasyOCR">EasyOCR</a> is an open-source OCR based on Deep Learning models.<br>
At first use, relevant languages models will be downloaded.
```python
from img2table.ocr import EasyOCR
ocr = EasyOCR(lang=["en"],
kw={"kwarg": kw_value, ...})
```
> <h4>Parameters</h4>
><dl>
> <dt>lang : list, optional, default <code>["en"]</code></dt>
> <dd style="font-style: italic;">Lang parameter used in EasyOCR for text extraction, check <a href="https://www.jaided.ai/easyocr">documentation for available languages</a></dd>
> <dt>kw : dict, optional, default <code>None</code></dt>
> <dd style="font-style: italic;">Dictionary containing additional keyword arguments passed to the EasyOCR <code>Reader</code> constructor.</dd>
></dl>
<br>
</details>
<details>
<summary>docTR<a name="docTR"></a></summary>
<br>
<a href="https://github.com/mindee/doctr">docTR</a> is an open-source OCR based on Deep Learning models.<br>
*In order to be used, docTR has to be installed by the user beforehand. Installation procedures are detailed in
the package documentation*
```python
from img2table.ocr import DocTR
ocr = DocTR(detect_language=False,
kw={"kwarg": kw_value, ...})
```
> <h4>Parameters</h4>
><dl>
> <dt>detect_language : bool, optional, default <code>False</code></dt>
> <dd style="font-style: italic;">Parameter indicating if language prediction is run on the document</dd>
> <dt>kw : dict, optional, default <code>None</code></dt>
> <dd style="font-style: italic;">Dictionary containing additional keyword arguments passed to the docTR <code>ocr_predictor</code> method.</dd>
></dl>
<br>
</details>
<details>
<summary>Surya OCR<a name="surya"></a></summary>
<br>
<b><i>Only available for <code>python >= 3.10</code></i></b><br>
<a href="https://github.com/VikParuchuri/surya">Surya</a> is an open-source OCR based on Deep Learning models.<br>
At first use, relevant models will be downloaded.
```python
from img2table.ocr import SuryaOCR
ocr = SuryaOCR(langs=["en"])
```
> <h4>Parameters</h4>
><dl>
> <dt>langs : list, optional, default <code>["en"]</code></dt>
> <dd style="font-style: italic;">Lang parameter used in Surya OCR for text extraction</dd>
></dl>
<br>
</details>
<details>
<summary>Google Vision<a name="vision"></a></summary>
<br>
Authentication to GCP can be done by setting the standard `GOOGLE_APPLICATION_CREDENTIALS` environment variable.<br>
If this variable is missing, an API key should be provided via the `api_key` parameter.
```python
from img2table.ocr import VisionOCR
ocr = VisionOCR(api_key="api_key", timeout=15)
```
> <h4>Parameters</h4>
><dl>
> <dt>api_key : str, optional, default <code>None</code></dt>
> <dd style="font-style: italic;">Google Vision API key</dd>
> <dt>timeout : int, optional, default <code>15</code></dt>
> <dd style="font-style: italic;">API requests timeout, in seconds</dd>
></dl>
<br>
</details>
<details>
<summary>AWS Textract<a name="textract"></a></summary>
<br>
When using AWS Textract, the DetectDocumentText API is exclusively called.
Authentication to AWS can be done by passing credentials to the `TextractOCR` class.<br>
If credentials are not provided, authentication is done using environment variables or configuration files.
Check `boto3` [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for more details.
```python
from img2table.ocr import TextractOCR
ocr = TextractOCR(aws_access_key_id="***",
aws_secret_access_key="***",
aws_session_token="***",
region="eu-west-1")
```
> <h4>Parameters</h4>
><dl>
> <dt>aws_access_key_id : str, optional, default <code>None</code></dt>
> <dd style="font-style: italic;">AWS access key id</dd>
> <dt>aws_secret_access_key : str, optional, default <code>None</code></dt>
> <dd style="font-style: italic;">AWS secret access key</dd>
> <dt>aws_session_token : str, optional, default <code>None</code></dt>
> <dd style="font-style: italic;">AWS temporary session token</dd>
> <dt>region : str, optional, default <code>None</code></dt>
> <dd style="font-style: italic;">AWS server region</dd>
></dl>
<br>
</details>
<details>
<summary>Azure Cognitive Services<a name="azure"></a></summary>
<br>
```python
from img2table.ocr import AzureOCR
ocr = AzureOCR(endpoint="abc.azure.com",
subscription_key="***")
```
> <h4>Parameters</h4>
><dl>
> <dt>endpoint : str, optional, default <code>None</code></dt>
> <dd style="font-style: italic;">Azure Cognitive Services endpoint. If None, inferred from the <code>COMPUTER_VISION_ENDPOINT</code> environment variable.</dd>
> <dt>subscription_key : str, optional, default <code>None</code></dt>
> <dd style="font-style: italic;">Azure Cognitive Services subscription key. If None, inferred from the <code>COMPUTER_VISION_SUBSCRIPTION_KEY</code> environment variable.</dd>
></dl>
<br>
</details>
---
### Table extraction <a name="table-extract"></a>
Multiple tables can be extracted at once from a PDF page/ an image using the `extract_tables` method of a document.
```python
from img2table.ocr import TesseractOCR
from img2table.document import Image
# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")
# Instantiation of document, either an image or a PDF
doc = Image(src)
# Table extraction
extracted_tables = doc.extract_tables(ocr=ocr,
implicit_rows=False,
implicit_columns=False,
borderless_tables=False,
min_confidence=50)
```
> <h4>Parameters</h4>
><dl>
> <dt>ocr : OCRInstance, optional, default <code>None</code></dt>
> <dd style="font-style: italic;">OCR instance used to parse document text. If None, cells content will not be extracted</dd>
> <dt>implicit_rows : bool, optional, default <code>False</code></dt>
> <dd style="font-style: italic;">Boolean indicating if implicit rows should be identified - check related <a href="/examples/Implicit.ipynb" target="_self">example</a></dd>
> <dt>implicit_columns : bool, optional, default <code>False</code></dt>
> <dd style="font-style: italic;">Boolean indicating if implicit columns should be identified - check related <a href="/examples/Implicit.ipynb" target="_self">example</a></dd>
> <dt>borderless_tables : bool, optional, default <code>False</code></dt>
> <dd style="font-style: italic;">Boolean indicating if <a href="/examples/borderless.ipynb" target="_self">borderless tables</a> are extracted <b>on top of</b> bordered tables.</dd>
> <dt>min_confidence : int, optional, default <code>50</code></dt>
> <dd style="font-style: italic;">Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)</dd>
></dl>
<b>NB</b>: Borderless table extraction can, by design, only extract tables with 3 or more columns.
#### Method return
The [`ExtractedTable`](/src/img2table/tables/objects/extraction.py#L35) class is used to model extracted tables from documents.
> <h4>Attributes</h4>
><dl>
> <dt>bbox : <code>BBox</code></dt>
> <dd style="font-style: italic;">Table bounding box</dd>
> <dt>title : str</dt>
> <dd style="font-style: italic;">Extracted title of the table</dd>
> <dt>content : <code>OrderedDict</code></dt>
> <dd style="font-style: italic;">Dict with row indexes as keys and list of <code>TableCell</code> objects as values</dd>
> <dt>df : <code>pd.DataFrame</code></dt>
> <dd style="font-style: italic;">Pandas DataFrame representation of the table</dd>
> <dt>html : <code>str</code></dt>
> <dd style="font-style: italic;">HTML representation of the table</dd>
></dl>
<br>
In order to access bounding boxes at the cell level, you can use the following code snippet :
```python
for id_row, row in enumerate(table.content.values()):
for id_col, cell in enumerate(row):
x1 = cell.bbox.x1
y1 = cell.bbox.y1
x2 = cell.bbox.x2
y2 = cell.bbox.y2
value = cell.value
```
<h5 style="color:grey">Images</h5>
`extract_tables` method from the `Image` class returns a list of `ExtractedTable` objects.
```Python
output = [ExtractedTable(...), ExtractedTable(...), ...]
```
<h5 style="color:grey">PDF</h5>
`extract_tables` method from the `PDF` class returns an `OrderedDict` object with page indexes as keys and lists of `ExtractedTable` objects.
```Python
output = {
0: [ExtractedTable(...), ...],
1: [],
...
last_page: [ExtractedTable(...), ...]
}
```
### Excel export <a name="xlsx"></a>
Tables extracted from a document can be exported to a xlsx file. The resulting file is composed of one worksheet per extracted table.<br>
Method arguments are mostly common with the `extract_tables` method.
```python
from img2table.ocr import TesseractOCR
from img2table.document import Image
# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")
# Instantiation of document, either an image or a PDF
doc = Image(src)
# Extraction of tables and creation of a xlsx file containing tables
doc.to_xlsx(dest=dest,
ocr=ocr,
implicit_rows=False,
implicit_columns=False,
borderless_tables=False,
min_confidence=50)
```
> <h4>Parameters</h4>
><dl>
> <dt>dest : str, <code>pathlib.Path</code> or <code>io.BytesIO</code>, required</dt>
> <dd style="font-style: italic;">Destination for xlsx file</dd>
> <dt>ocr : OCRInstance, optional, default <code>None</code></dt>
> <dd style="font-style: italic;">OCR instance used to parse document text. If None, cells content will not be extracted</dd>
> <dt>implicit_rows : bool, optional, default <code>False</code></dt>
> <dd style="font-style: italic;">Boolean indicating if implicit rows should be identified - check related <a href="/examples/Implicit.ipynb" target="_self">example</a></dd>
> <dt>implicit_rows : bool, optional, default <code>False</code></dt>
> <dd style="font-style: italic;">Boolean indicating if implicit columns should be identified - check related <a href="/examples/Implicit.ipynb" target="_self">example</a></dd>
> <dt>borderless_tables : bool, optional, default <code>False</code></dt>
> <dd style="font-style: italic;">Boolean indicating if <a href="/examples/borderless.ipynb" target="_self">borderless tables</a> are extracted. It requires to provide an OCR to the method in order to be performed - <b>feature in alpha version</b></dd>
> <dt>min_confidence : int, optional, default <code>50</code></dt>
> <dd style="font-style: italic;">Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)</dd>
></dl>
> <h4>Returns</h4>
> If a <code>io.BytesIO</code> buffer is passed as dest arg, it is returned containing xlsx data
## Examples <a name="examples"></a>
Several Jupyter notebooks with examples are available :
<ul>
<li>
<a href="/examples/Basic_usage.ipynb" target="_self">Basic usage</a>: generic library usage, including examples with images, PDF and OCRs
</li>
<li>
<a href="/examples/borderless.ipynb" target="_self">Borderless tables</a>: specific examples dedicated to the extraction of borderless tables
</li>
<li>
<a href="/examples/Implicit.ipynb" target="_self">Implicit content</a>: illustrated effect
of the parameter <code>implicit_rows</code>/<code>implicit_columns</code> of the <code>extract_tables</code> method
</li>
</ul>
## Caveats / FYI <a name="fyi"></a>
<ul>
<li>
For table extraction, results are highly dependent on OCR quality. By design, tables where no OCR data
can be found are not returned.
</li>
<li>
The library is tailored for usage on documents with white/light background.
Effectiveness can not be guaranteed on other type of documents.
</li>
<li>
Table detection using only OpenCV processing can have some limitations. If the library fails to detect tables,
you may check CNN based solutions.
</li>
</ul>
Raw data
{
"_id": null,
"home_page": "https://github.com/xavctn/img2table",
"name": "img2table",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Xavier Canton",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/72/84/0b81a76b61dcaff06624d0924b50ccf33623988136bf6fd02ea9d677de72/img2table-1.4.0.tar.gz",
"platform": null,
"description": "# img2table\n\n`img2table` is a simple, easy to use, table identification and extraction Python Library based on [OpenCV](https://opencv.org/) image \nprocessing that supports most common image file formats as well as PDF files.\n\nThanks to its design, it provides a practical and lighter alternative to Neural Networks based solutions, especially for usage on CPU.\n\n## Table of contents\n* [Installation](#installation)\n* [Features](#features)\n* [Supported file formats](#supported-file-formats)\n* [Usage](#usage)\n * [Documents](#documents)\n * [Images](#images-doc)\n * [PDF](#pdf-doc)\n * [Supported OCRs](#ocr)\n * [Table extraction](#table-extract)\n * [Excel export](#xlsx)\n* [Examples](#examples)\n* [Caveats / FYI](#fyi)\n\n\n## Installation <a name=\"installation\"></a>\nThe library can be installed via pip:\n\n> <code>pip install img2table</code>: Standard installation, supporting Tesseract<br>\n> <code>pip install img2table[paddle]</code>: For usage with Paddle OCR<br>\n> <code>pip install img2table[easyocr]</code>: For usage with EasyOCR<br>\n> <code>pip install img2table[surya]</code>: For usage with Surya OCR<br>\n> <code>pip install img2table[gcp]</code>: For usage with Google Vision OCR<br>\n> <code>pip install img2table[aws]</code>: For usage with AWS Textract OCR<br>\n> <code>pip install img2table[azure]</code>: For usage with Azure Cognitive Services OCR\n\n## Features <a name=\"features\"></a>\n\n* Table identification for images and PDF files, including bounding boxes at the table cell level\n* Handling of complex table structures such as merged cells\n* Handling of implicit content - see [example](/examples/Implicit.ipynb)\n* Table content extraction by providing support for OCR services / tools\n* Extracted tables are returned as a simple object, including a Pandas DataFrame representation\n* Export extracted tables to an Excel file, preserving their original structure\n\n## Supported file formats <a name=\"supported-file-formats\"></a>\n\n### Images <a name=\"images-formats\"></a>\n\nImages are loaded using the `opencv-python` library, supported formats are listed below.\n\n<details>\n<summary>Supported image formats</summary>\n<br>\n\n<blockquote>\n<ul>\n<li>Windows bitmaps - <em>.bmp, </em>.dib</li>\n<li>JPEG files - <em>.jpeg, </em>.jpg, *.jpe</li>\n<li>JPEG 2000 files - *.jp2</li>\n<li>Portable Network Graphics - *.png</li>\n<li>WebP - *.webp</li>\n<li>Portable image format - <em>.pbm, </em>.pgm, <em>.ppm </em>.pxm, *.pnm</li>\n<li>PFM files - *.pfm</li>\n<li>Sun rasters - <em>.sr, </em>.ras</li>\n<li>TIFF files - <em>.tiff, </em>.tif</li>\n<li>OpenEXR Image files - *.exr</li>\n<li>Radiance HDR - <em>.hdr, </em>.pic</li>\n<li>Raster and Vector geospatial data supported by GDAL<br>\n<cite><a href=\"https://docs.opencv.org/4.x/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56\">OpenCV: Image file reading and writing</a></cite></li>\n</ul>\n</blockquote>\n</details>\nMulti-page images are not supported.\n\n---\n\n### PDF <a name=\"pdf-formats\"></a>\n\nBoth native and scanned PDF files are supported.\n\n## Usage <a name=\"usage\"></a>\n\n### Documents <a name=\"documents\"></a>\n\n#### Images <a name=\"images-doc\"></a>\nImages are instantiated as follows :\n```python\nfrom img2table.document import Image\n\nimage = Image(src, \n detect_rotation=False)\n```\n\n> <h4>Parameters</h4>\n><dl>\n> <dt>src : str, <code>pathlib.Path</code>, bytes or <code>io.BytesIO</code>, required</dt>\n> <dd style=\"font-style: italic;\">Image source</dd>\n> <dt>detect_rotation : bool, optional, default <code>False</code></dt>\n> <dd style=\"font-style: italic;\">Detect and correct skew/rotation of the image</dd>\n></dl>\n<br>\nThe implemented method to handle skewed/rotated images supports skew angles up to 45\u00b0 and is\nbased on the publication by <a href=\"https://www.mdpi.com/2079-9292/9/1/55\">Huang, 2020</a>.<br>\nSetting the <code>detect_rotation</code> parameter to <code>True</code>, image coordinates and bounding boxes returned by other \nmethods might not correspond to the original image.\n\n#### PDF <a name=\"pdf-doc\"></a>\nPDF files are instantiated as follows :\n```python\nfrom img2table.document import PDF\n\npdf = PDF(src, \n pages=[0, 2],\n detect_rotation=False,\n pdf_text_extraction=True)\n```\n\n> <h4>Parameters</h4>\n><dl>\n> <dt>src : str, <code>pathlib.Path</code>, bytes or <code>io.BytesIO</code>, required</dt>\n> <dd style=\"font-style: italic;\">PDF source</dd>\n> <dt>pages : list, optional, default <code>None</code></dt>\n> <dd style=\"font-style: italic;\">List of PDF page indexes to be processed. If None, all pages are processed</dd>\n> <dt>detect_rotation : bool, optional, default <code>False</code></dt>\n> <dd style=\"font-style: italic;\">Detect and correct skew/rotation of extracted images from the PDF</dd>\n> <dt>pdf_text_extraction : bool, optional, default <code>True</code></dt>\n> <dd style=\"font-style: italic;\">Extract text from the PDF file for native PDFs</dd>\n></dl>\n\nPDF pages are converted to images with a 200 DPI for table identification.\n\n---\n\n### OCR <a name=\"ocr\"></a>\n\n`img2table` provides an interface for several OCR services and tools in order to parse table content.<br>\nIf possible (i.e for native PDF), PDF text will be extracted directly from the file and the OCR service/tool will not be called.\n\n<details>\n<summary>Tesseract<a name=\"tesseract\"></a></summary>\n<br>\n\n```python\nfrom img2table.ocr import TesseractOCR\n\nocr = TesseractOCR(n_threads=1, \n lang=\"eng\", \n psm=11,\n tessdata_dir=\"...\")\n```\n\n> <h4>Parameters</h4>\n><dl>\n> <dt>n_threads : int, optional, default <code>1</code></dt>\n> <dd style=\"font-style: italic;\">Number of concurrent threads used to call Tesseract</dd>\n> <dt>lang : str, optional, default <code>\"eng\"</code></dt>\n> <dd style=\"font-style: italic;\">Lang parameter used in Tesseract for text extraction</dd>\n> <dt>psm : int, optional, default <code>11</code></dt>\n> <dd style=\"font-style: italic;\">PSM parameter used in Tesseract, run <code>tesseract --help-psm</code> for details</dd>\n> <dt>tessdata_dir : str, optional, default <code>None</code></dt>\n> <dd style=\"font-style: italic;\">Directory containing Tesseract traineddata files. If None, the <code>TESSDATA_PREFIX</code> env variable is used.</dd>\n></dl>\n\n\n*Usage of [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract) requires prior installation. \nCheck [documentation](https://tesseract-ocr.github.io/tessdoc/) for instructions.*\n<br>\n*For Windows users getting environment variable errors, you can check this [tutorial](https://linuxhint.com/install-tesseract-windows/)*\n<br>\n</details>\n\n<details>\n<summary>PaddleOCR<a name=\"paddle\"></a></summary>\n<br>\n\n<a href=\"https://github.com/PaddlePaddle/PaddleOCR\">PaddleOCR</a> is an open-source OCR based on Deep Learning models.<br>\nAt first use, relevant languages models will be downloaded.\n\n```python\nfrom img2table.ocr import PaddleOCR\n\nocr = PaddleOCR(lang=\"en\",\n kw={\"kwarg\": kw_value, ...})\n```\n\n> <h4>Parameters</h4>\n><dl>\n> <dt>lang : str, optional, default <code>\"en\"</code></dt>\n> <dd style=\"font-style: italic;\">Lang parameter used in Paddle for text extraction, check <a href=\"https://github.com/Mushroomcat9998/PaddleOCR/blob/main/doc/doc_en/multi_languages_en.md#5-support-languages-and-abbreviations\">documentation for available languages</a></dd>\n> <dt>kw : dict, optional, default <code>None</code></dt>\n> <dd style=\"font-style: italic;\">Dictionary containing additional keyword arguments passed to the PaddleOCR constructor.</dd>\n></dl>\n\n<br>\n<b>NB:</b> For usage of PaddleOCR with GPU, the CUDA specific version of paddlepaddle-gpu has to be installed by the user manually \nas stated in this <a href=\"https://github.com/PaddlePaddle/PaddleOCR/issues/7993\">issue</a>.\n\n```bash\n# Example of installation with CUDA 11.8\npip install paddlepaddle-gpu==2.5.0rc1.post118 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html\npip install paddleocr img2table\n```\n\nIf you get an error trying to run PaddleOCR on Ubuntu,\nplease check this <a href=\"https://github.com/PaddlePaddle/PaddleOCR/discussions/9989#discussioncomment-6642037\">issue</a> for a working solution.\n\n<br>\n</details>\n\n\n<details>\n<summary>EasyOCR<a name=\"easyocr\"></a></summary>\n<br>\n\n<a href=\"https://github.com/JaidedAI/EasyOCR\">EasyOCR</a> is an open-source OCR based on Deep Learning models.<br>\nAt first use, relevant languages models will be downloaded.\n\n```python\nfrom img2table.ocr import EasyOCR\n\nocr = EasyOCR(lang=[\"en\"],\n kw={\"kwarg\": kw_value, ...})\n```\n\n> <h4>Parameters</h4>\n><dl>\n> <dt>lang : list, optional, default <code>[\"en\"]</code></dt>\n> <dd style=\"font-style: italic;\">Lang parameter used in EasyOCR for text extraction, check <a href=\"https://www.jaided.ai/easyocr\">documentation for available languages</a></dd>\n> <dt>kw : dict, optional, default <code>None</code></dt>\n> <dd style=\"font-style: italic;\">Dictionary containing additional keyword arguments passed to the EasyOCR <code>Reader</code> constructor.</dd>\n></dl>\n\n<br>\n</details>\n\n<details>\n<summary>docTR<a name=\"docTR\"></a></summary>\n<br>\n\n<a href=\"https://github.com/mindee/doctr\">docTR</a> is an open-source OCR based on Deep Learning models.<br>\n*In order to be used, docTR has to be installed by the user beforehand. Installation procedures are detailed in\nthe package documentation*\n\n```python\nfrom img2table.ocr import DocTR\n\nocr = DocTR(detect_language=False,\n kw={\"kwarg\": kw_value, ...})\n```\n\n> <h4>Parameters</h4>\n><dl>\n> <dt>detect_language : bool, optional, default <code>False</code></dt>\n> <dd style=\"font-style: italic;\">Parameter indicating if language prediction is run on the document</dd>\n> <dt>kw : dict, optional, default <code>None</code></dt>\n> <dd style=\"font-style: italic;\">Dictionary containing additional keyword arguments passed to the docTR <code>ocr_predictor</code> method.</dd>\n></dl>\n\n<br>\n</details>\n\n\n<details>\n<summary>Surya OCR<a name=\"surya\"></a></summary>\n<br>\n\n<b><i>Only available for <code>python >= 3.10</code></i></b><br>\n<a href=\"https://github.com/VikParuchuri/surya\">Surya</a> is an open-source OCR based on Deep Learning models.<br>\nAt first use, relevant models will be downloaded.\n\n```python\nfrom img2table.ocr import SuryaOCR\n\nocr = SuryaOCR(langs=[\"en\"])\n```\n\n> <h4>Parameters</h4>\n><dl>\n> <dt>langs : list, optional, default <code>[\"en\"]</code></dt>\n> <dd style=\"font-style: italic;\">Lang parameter used in Surya OCR for text extraction</dd>\n></dl>\n\n<br>\n</details>\n\n\n<details>\n<summary>Google Vision<a name=\"vision\"></a></summary>\n<br>\n\nAuthentication to GCP can be done by setting the standard `GOOGLE_APPLICATION_CREDENTIALS` environment variable.<br>\nIf this variable is missing, an API key should be provided via the `api_key` parameter.\n\n```python\nfrom img2table.ocr import VisionOCR\n\nocr = VisionOCR(api_key=\"api_key\", timeout=15)\n```\n\n> <h4>Parameters</h4>\n><dl>\n> <dt>api_key : str, optional, default <code>None</code></dt>\n> <dd style=\"font-style: italic;\">Google Vision API key</dd>\n> <dt>timeout : int, optional, default <code>15</code></dt>\n> <dd style=\"font-style: italic;\">API requests timeout, in seconds</dd>\n></dl>\n<br>\n</details>\n\n<details>\n<summary>AWS Textract<a name=\"textract\"></a></summary>\n<br>\n\nWhen using AWS Textract, the DetectDocumentText API is exclusively called.\n\nAuthentication to AWS can be done by passing credentials to the `TextractOCR` class.<br>\nIf credentials are not provided, authentication is done using environment variables or configuration files. \nCheck `boto3` [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for more details.\n\n```python\nfrom img2table.ocr import TextractOCR\n\nocr = TextractOCR(aws_access_key_id=\"***\",\n aws_secret_access_key=\"***\",\n aws_session_token=\"***\",\n region=\"eu-west-1\")\n```\n\n> <h4>Parameters</h4>\n><dl>\n> <dt>aws_access_key_id : str, optional, default <code>None</code></dt>\n> <dd style=\"font-style: italic;\">AWS access key id</dd>\n> <dt>aws_secret_access_key : str, optional, default <code>None</code></dt>\n> <dd style=\"font-style: italic;\">AWS secret access key</dd>\n> <dt>aws_session_token : str, optional, default <code>None</code></dt>\n> <dd style=\"font-style: italic;\">AWS temporary session token</dd>\n> <dt>region : str, optional, default <code>None</code></dt>\n> <dd style=\"font-style: italic;\">AWS server region</dd>\n></dl>\n<br>\n</details>\n\n<details>\n<summary>Azure Cognitive Services<a name=\"azure\"></a></summary>\n<br>\n\n```python\nfrom img2table.ocr import AzureOCR\n\nocr = AzureOCR(endpoint=\"abc.azure.com\",\n subscription_key=\"***\")\n```\n\n> <h4>Parameters</h4>\n><dl>\n> <dt>endpoint : str, optional, default <code>None</code></dt>\n> <dd style=\"font-style: italic;\">Azure Cognitive Services endpoint. If None, inferred from the <code>COMPUTER_VISION_ENDPOINT</code> environment variable.</dd>\n> <dt>subscription_key : str, optional, default <code>None</code></dt>\n> <dd style=\"font-style: italic;\">Azure Cognitive Services subscription key. If None, inferred from the <code>COMPUTER_VISION_SUBSCRIPTION_KEY</code> environment variable.</dd>\n></dl>\n<br>\n</details>\n\n---\n\n### Table extraction <a name=\"table-extract\"></a>\n\nMultiple tables can be extracted at once from a PDF page/ an image using the `extract_tables` method of a document.\n\n```python\nfrom img2table.ocr import TesseractOCR\nfrom img2table.document import Image\n\n# Instantiation of OCR\nocr = TesseractOCR(n_threads=1, lang=\"eng\")\n\n# Instantiation of document, either an image or a PDF\ndoc = Image(src)\n\n# Table extraction\nextracted_tables = doc.extract_tables(ocr=ocr,\n implicit_rows=False,\n implicit_columns=False,\n borderless_tables=False,\n min_confidence=50)\n```\n> <h4>Parameters</h4>\n><dl>\n> <dt>ocr : OCRInstance, optional, default <code>None</code></dt>\n> <dd style=\"font-style: italic;\">OCR instance used to parse document text. If None, cells content will not be extracted</dd>\n> <dt>implicit_rows : bool, optional, default <code>False</code></dt>\n> <dd style=\"font-style: italic;\">Boolean indicating if implicit rows should be identified - check related <a href=\"/examples/Implicit.ipynb\" target=\"_self\">example</a></dd>\n> <dt>implicit_columns : bool, optional, default <code>False</code></dt>\n> <dd style=\"font-style: italic;\">Boolean indicating if implicit columns should be identified - check related <a href=\"/examples/Implicit.ipynb\" target=\"_self\">example</a></dd>\n> <dt>borderless_tables : bool, optional, default <code>False</code></dt>\n> <dd style=\"font-style: italic;\">Boolean indicating if <a href=\"/examples/borderless.ipynb\" target=\"_self\">borderless tables</a> are extracted <b>on top of</b> bordered tables.</dd>\n> <dt>min_confidence : int, optional, default <code>50</code></dt>\n> <dd style=\"font-style: italic;\">Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)</dd>\n></dl>\n\n<b>NB</b>: Borderless table extraction can, by design, only extract tables with 3 or more columns.\n\n#### Method return\n\nThe [`ExtractedTable`](/src/img2table/tables/objects/extraction.py#L35) class is used to model extracted tables from documents.\n\n> <h4>Attributes</h4>\n><dl>\n> <dt>bbox : <code>BBox</code></dt>\n> <dd style=\"font-style: italic;\">Table bounding box</dd>\n> <dt>title : str</dt>\n> <dd style=\"font-style: italic;\">Extracted title of the table</dd>\n> <dt>content : <code>OrderedDict</code></dt>\n> <dd style=\"font-style: italic;\">Dict with row indexes as keys and list of <code>TableCell</code> objects as values</dd>\n> <dt>df : <code>pd.DataFrame</code></dt>\n> <dd style=\"font-style: italic;\">Pandas DataFrame representation of the table</dd>\n> <dt>html : <code>str</code></dt>\n> <dd style=\"font-style: italic;\">HTML representation of the table</dd>\n></dl>\n\n<br>\n\nIn order to access bounding boxes at the cell level, you can use the following code snippet :\n```python\nfor id_row, row in enumerate(table.content.values()):\n for id_col, cell in enumerate(row):\n x1 = cell.bbox.x1\n y1 = cell.bbox.y1\n x2 = cell.bbox.x2\n y2 = cell.bbox.y2\n value = cell.value\n```\n\n<h5 style=\"color:grey\">Images</h5>\n\n`extract_tables` method from the `Image` class returns a list of `ExtractedTable` objects. \n```Python\noutput = [ExtractedTable(...), ExtractedTable(...), ...]\n```\n\n<h5 style=\"color:grey\">PDF</h5>\n\n`extract_tables` method from the `PDF` class returns an `OrderedDict` object with page indexes as keys and lists of `ExtractedTable` objects. \n```Python\noutput = {\n 0: [ExtractedTable(...), ...],\n 1: [],\n ...\n last_page: [ExtractedTable(...), ...]\n}\n```\n\n\n### Excel export <a name=\"xlsx\"></a>\n\nTables extracted from a document can be exported to a xlsx file. The resulting file is composed of one worksheet per extracted table.<br>\nMethod arguments are mostly common with the `extract_tables` method.\n\n```python\nfrom img2table.ocr import TesseractOCR\nfrom img2table.document import Image\n\n# Instantiation of OCR\nocr = TesseractOCR(n_threads=1, lang=\"eng\")\n\n# Instantiation of document, either an image or a PDF\ndoc = Image(src)\n\n# Extraction of tables and creation of a xlsx file containing tables\ndoc.to_xlsx(dest=dest,\n ocr=ocr,\n implicit_rows=False,\n implicit_columns=False,\n borderless_tables=False,\n min_confidence=50)\n```\n> <h4>Parameters</h4>\n><dl>\n> <dt>dest : str, <code>pathlib.Path</code> or <code>io.BytesIO</code>, required</dt>\n> <dd style=\"font-style: italic;\">Destination for xlsx file</dd>\n> <dt>ocr : OCRInstance, optional, default <code>None</code></dt>\n> <dd style=\"font-style: italic;\">OCR instance used to parse document text. If None, cells content will not be extracted</dd>\n> <dt>implicit_rows : bool, optional, default <code>False</code></dt>\n> <dd style=\"font-style: italic;\">Boolean indicating if implicit rows should be identified - check related <a href=\"/examples/Implicit.ipynb\" target=\"_self\">example</a></dd>\n> <dt>implicit_rows : bool, optional, default <code>False</code></dt>\n> <dd style=\"font-style: italic;\">Boolean indicating if implicit columns should be identified - check related <a href=\"/examples/Implicit.ipynb\" target=\"_self\">example</a></dd>\n> <dt>borderless_tables : bool, optional, default <code>False</code></dt>\n> <dd style=\"font-style: italic;\">Boolean indicating if <a href=\"/examples/borderless.ipynb\" target=\"_self\">borderless tables</a> are extracted. It requires to provide an OCR to the method in order to be performed - <b>feature in alpha version</b></dd>\n> <dt>min_confidence : int, optional, default <code>50</code></dt>\n> <dd style=\"font-style: italic;\">Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)</dd>\n></dl>\n> <h4>Returns</h4>\n> If a <code>io.BytesIO</code> buffer is passed as dest arg, it is returned containing xlsx data\n\n\n\n## Examples <a name=\"examples\"></a>\n\nSeveral Jupyter notebooks with examples are available :\n<ul>\n<li>\n<a href=\"/examples/Basic_usage.ipynb\" target=\"_self\">Basic usage</a>: generic library usage, including examples with images, PDF and OCRs\n</li>\n<li>\n<a href=\"/examples/borderless.ipynb\" target=\"_self\">Borderless tables</a>: specific examples dedicated to the extraction of borderless tables\n</li>\n<li>\n<a href=\"/examples/Implicit.ipynb\" target=\"_self\">Implicit content</a>: illustrated effect \nof the parameter <code>implicit_rows</code>/<code>implicit_columns</code> of the <code>extract_tables</code> method\n</li>\n</ul>\n\n## Caveats / FYI <a name=\"fyi\"></a>\n\n<ul>\n<li>\nFor table extraction, results are highly dependent on OCR quality. By design, tables where no OCR data \ncan be found are not returned.\n</li>\n<li>\nThe library is tailored for usage on documents with white/light background. \nEffectiveness can not be guaranteed on other type of documents. \n</li>\n<li>\nTable detection using only OpenCV processing can have some limitations. If the library fails to detect tables, \nyou may check CNN based solutions.\n</li>\n</ul>\n\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing",
"version": "1.4.0",
"project_urls": {
"Homepage": "https://github.com/xavctn/img2table"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "de9c98fc3e8b40261bcba1f3c32bad10af557b408f3a39616ccafb7ea2006604",
"md5": "02a0ee0425b4f9bffee543b2e6323b88",
"sha256": "1a1e27b4e8a92aa7bb1eafeaafe41db9106394a124f82c92db1fb600d67331d5"
},
"downloads": -1,
"filename": "img2table-1.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "02a0ee0425b4f9bffee543b2e6323b88",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.8",
"size": 92062,
"upload_time": "2024-11-11T19:07:29",
"upload_time_iso_8601": "2024-11-11T19:07:29.598394Z",
"url": "https://files.pythonhosted.org/packages/de/9c/98fc3e8b40261bcba1f3c32bad10af557b408f3a39616ccafb7ea2006604/img2table-1.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "72840b81a76b61dcaff06624d0924b50ccf33623988136bf6fd02ea9d677de72",
"md5": "5b9cb8944c09680b63f510d4ead99537",
"sha256": "cccf7dea4dd8503f3f1665837baf5c86bb81735ad7daaec5dc5835f36951e79d"
},
"downloads": -1,
"filename": "img2table-1.4.0.tar.gz",
"has_sig": false,
"md5_digest": "5b9cb8944c09680b63f510d4ead99537",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.8",
"size": 3999544,
"upload_time": "2024-11-11T19:07:31",
"upload_time_iso_8601": "2024-11-11T19:07:31.271512Z",
"url": "https://files.pythonhosted.org/packages/72/84/0b81a76b61dcaff06624d0924b50ccf33623988136bf6fd02ea9d677de72/img2table-1.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-11 19:07:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "xavctn",
"github_project": "img2table",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "img2table"
}