img2table

Name	img2table JSON
Version	1.4.0 JSON
	download
home_page	https://github.com/xavctn/img2table
Summary	img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
upload_time	2024-11-11 19:07:31
maintainer	None
docs_url	None
author	Xavier Canton
requires_python	<3.13,>=3.8
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # img2table

`img2table` is a simple, easy to use, table identification and extraction Python Library based on [OpenCV](https://opencv.org/) image 
processing that supports most common image file formats as well as PDF files.

Thanks to its design, it provides a practical and lighter alternative to Neural Networks based solutions, especially for usage on CPU.

## Table of contents
* [Installation](#installation)
* [Features](#features)
* [Supported file formats](#supported-file-formats)
* [Usage](#usage)
   * [Documents](#documents)
      * [Images](#images-doc)
      * [PDF](#pdf-doc)
   * [Supported OCRs](#ocr)
   * [Table extraction](#table-extract)
   * [Excel export](#xlsx)
* [Examples](#examples)
* [Caveats / FYI](#fyi)


## Installation <a name="installation"></a>
The library can be installed via pip:

> <code>pip install img2table</code>: Standard installation, supporting Tesseract<br>
> <code>pip install img2table[paddle]</code>: For usage with Paddle OCR<br>
> <code>pip install img2table[easyocr]</code>: For usage with EasyOCR<br>
> <code>pip install img2table[surya]</code>: For usage with Surya OCR<br>
> <code>pip install img2table[gcp]</code>: For usage with Google Vision OCR<br>
> <code>pip install img2table[aws]</code>: For usage with AWS Textract OCR<br>
> <code>pip install img2table[azure]</code>: For usage with Azure Cognitive Services OCR

## Features <a name="features"></a>

* Table identification for images and PDF files, including bounding boxes at the table cell level
* Handling of complex table structures such as merged cells
* Handling of implicit content - see [example](/examples/Implicit.ipynb)
* Table content extraction by providing support for OCR services / tools
* Extracted tables are returned as a simple object, including a Pandas DataFrame representation
* Export extracted tables to an Excel file, preserving their original structure

## Supported file formats <a name="supported-file-formats"></a>

### Images <a name="images-formats"></a>

Images are loaded using the `opencv-python` library, supported formats are listed below.

<details>
<summary>Supported image formats</summary>
<br>

<blockquote>
<ul>
<li>Windows bitmaps - <em>.bmp, </em>.dib</li>
<li>JPEG files - <em>.jpeg, </em>.jpg, *.jpe</li>
<li>JPEG 2000 files - *.jp2</li>
<li>Portable Network Graphics - *.png</li>
<li>WebP - *.webp</li>
<li>Portable image format - <em>.pbm, </em>.pgm, <em>.ppm </em>.pxm, *.pnm</li>
<li>PFM files - *.pfm</li>
<li>Sun rasters - <em>.sr, </em>.ras</li>
<li>TIFF files - <em>.tiff, </em>.tif</li>
<li>OpenEXR Image files - *.exr</li>
<li>Radiance HDR - <em>.hdr, </em>.pic</li>
<li>Raster and Vector geospatial data supported by GDAL<br>
<cite><a href="https://docs.opencv.org/4.x/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56">OpenCV: Image file reading and writing</a></cite></li>
</ul>
</blockquote>
</details>
Multi-page images are not supported.

---

### PDF <a name="pdf-formats"></a>

Both native and scanned PDF files are supported.

## Usage <a name="usage"></a>

### Documents <a name="documents"></a>

#### Images <a name="images-doc"></a>
Images are instantiated as follows :
```python
from img2table.document import Image

image = Image(src, 
              detect_rotation=False)
```

> <h4>Parameters</h4>
><dl>
>    <dt>src : str, <code>pathlib.Path</code>, bytes or <code>io.BytesIO</code>, required</dt>
>    <dd style="font-style: italic;">Image source</dd>
>    <dt>detect_rotation : bool, optional, default <code>False</code></dt>
>    <dd style="font-style: italic;">Detect and correct skew/rotation of the image</dd>
></dl>
<br>
The implemented method to handle skewed/rotated images supports skew angles up to 45° and is
based on the publication by <a href="https://www.mdpi.com/2079-9292/9/1/55">Huang, 2020</a>.<br>
Setting the <code>detect_rotation</code> parameter to <code>True</code>, image coordinates and bounding boxes returned by other 
methods might not correspond to the original image.

#### PDF <a name="pdf-doc"></a>
PDF files are instantiated as follows :
```python
from img2table.document import PDF

pdf = PDF(src, 
          pages=[0, 2],
          detect_rotation=False,
          pdf_text_extraction=True)
```

> <h4>Parameters</h4>
><dl>
>    <dt>src : str, <code>pathlib.Path</code>, bytes or <code>io.BytesIO</code>, required</dt>
>    <dd style="font-style: italic;">PDF source</dd>
>    <dt>pages : list, optional, default <code>None</code></dt>
>    <dd style="font-style: italic;">List of PDF page indexes to be processed. If None, all pages are processed</dd>
>    <dt>detect_rotation : bool, optional, default <code>False</code></dt>
>    <dd style="font-style: italic;">Detect and correct skew/rotation of extracted images from the PDF</dd>
>    <dt>pdf_text_extraction : bool, optional, default <code>True</code></dt>
>    <dd style="font-style: italic;">Extract text from the PDF file for native PDFs</dd>
></dl>

PDF pages are converted to images with a 200 DPI for table identification.

---

### OCR <a name="ocr"></a>

`img2table` provides an interface for several OCR services and tools in order to parse table content.<br>
If possible (i.e for native PDF), PDF text will be extracted directly from the file and the OCR service/tool will not be called.

<details>
<summary>Tesseract<a name="tesseract"></a></summary>
<br>

```python
from img2table.ocr import TesseractOCR

ocr = TesseractOCR(n_threads=1, 
                   lang="eng", 
                   psm=11,
                   tessdata_dir="...")
```

> <h4>Parameters</h4>
><dl>
>    <dt>n_threads : int, optional, default <code>1</code></dt>
>    <dd style="font-style: italic;">Number of concurrent threads used to call Tesseract</dd>
>    <dt>lang : str, optional, default <code>"eng"</code></dt>
>    <dd style="font-style: italic;">Lang parameter used in Tesseract for text extraction</dd>
>    <dt>psm : int, optional, default <code>11</code></dt>
>    <dd style="font-style: italic;">PSM parameter used in Tesseract, run <code>tesseract --help-psm</code> for details</dd>
>    <dt>tessdata_dir : str, optional, default <code>None</code></dt>
>    <dd style="font-style: italic;">Directory containing Tesseract traineddata files. If None, the <code>TESSDATA_PREFIX</code> env variable is used.</dd>
></dl>


*Usage of [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract) requires prior installation. 
Check [documentation](https://tesseract-ocr.github.io/tessdoc/) for instructions.*
<br>
*For Windows users getting environment variable errors, you can check this [tutorial](https://linuxhint.com/install-tesseract-windows/)*
<br>
</details>

<details>
<summary>PaddleOCR<a name="paddle"></a></summary>
<br>

<a href="https://github.com/PaddlePaddle/PaddleOCR">PaddleOCR</a> is an open-source OCR based on Deep Learning models.<br>
At first use, relevant languages models will be downloaded.

```python
from img2table.ocr import PaddleOCR

ocr = PaddleOCR(lang="en",
                kw={"kwarg": kw_value, ...})
```

> <h4>Parameters</h4>
><dl>
>    <dt>lang : str, optional, default <code>"en"</code></dt>
>    <dd style="font-style: italic;">Lang parameter used in Paddle for text extraction, check <a href="https://github.com/Mushroomcat9998/PaddleOCR/blob/main/doc/doc_en/multi_languages_en.md#5-support-languages-and-abbreviations">documentation for available languages</a></dd>
>    <dt>kw : dict, optional, default <code>None</code></dt>
>    <dd style="font-style: italic;">Dictionary containing additional keyword arguments passed to the PaddleOCR constructor.</dd>
></dl>

<br>
<b>NB:</b> For usage of PaddleOCR with GPU, the CUDA specific version of paddlepaddle-gpu has to be installed by the user manually 
as stated in this <a href="https://github.com/PaddlePaddle/PaddleOCR/issues/7993">issue</a>.

```bash
# Example of installation with CUDA 11.8
pip install paddlepaddle-gpu==2.5.0rc1.post118 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
pip install paddleocr img2table
```

If you get an error trying to run PaddleOCR on Ubuntu,
please check this <a href="https://github.com/PaddlePaddle/PaddleOCR/discussions/9989#discussioncomment-6642037">issue</a> for a working solution.

<br>
</details>


<details>
<summary>EasyOCR<a name="easyocr"></a></summary>
<br>

<a href="https://github.com/JaidedAI/EasyOCR">EasyOCR</a> is an open-source OCR based on Deep Learning models.<br>
At first use, relevant languages models will be downloaded.

```python
from img2table.ocr import EasyOCR

ocr = EasyOCR(lang=["en"],
              kw={"kwarg": kw_value, ...})
```

> <h4>Parameters</h4>
><dl>
>    <dt>lang : list, optional, default <code>["en"]</code></dt>
>    <dd style="font-style: italic;">Lang parameter used in EasyOCR for text extraction, check <a href="https://www.jaided.ai/easyocr">documentation for available languages</a></dd>
>    <dt>kw : dict, optional, default <code>None</code></dt>
>    <dd style="font-style: italic;">Dictionary containing additional keyword arguments passed to the EasyOCR <code>Reader</code> constructor.</dd>
></dl>

<br>
</details>

<details>
<summary>docTR<a name="docTR"></a></summary>
<br>

<a href="https://github.com/mindee/doctr">docTR</a> is an open-source OCR based on Deep Learning models.<br>
*In order to be used, docTR has to be installed by the user beforehand. Installation procedures are detailed in
the package documentation*

```python
from img2table.ocr import DocTR

ocr = DocTR(detect_language=False,
            kw={"kwarg": kw_value, ...})
```

> <h4>Parameters</h4>
><dl>
>    <dt>detect_language : bool, optional, default <code>False</code></dt>
>    <dd style="font-style: italic;">Parameter indicating if language prediction is run on the document</dd>
>    <dt>kw : dict, optional, default <code>None</code></dt>
>    <dd style="font-style: italic;">Dictionary containing additional keyword arguments passed to the docTR <code>ocr_predictor</code> method.</dd>
></dl>

<br>
</details>


<details>
<summary>Surya OCR<a name="surya"></a></summary>
<br>

<b><i>Only available for <code>python >= 3.10</code></i></b><br>
<a href="https://github.com/VikParuchuri/surya">Surya</a> is an open-source OCR based on Deep Learning models.<br>
At first use, relevant models will be downloaded.

```python
from img2table.ocr import SuryaOCR

ocr = SuryaOCR(langs=["en"])
```

> <h4>Parameters</h4>
><dl>
>    <dt>langs : list, optional, default <code>["en"]</code></dt>
>    <dd style="font-style: italic;">Lang parameter used in Surya OCR for text extraction</dd>
></dl>

<br>
</details>


<details>
<summary>Google Vision<a name="vision"></a></summary>
<br>

Authentication to GCP can be done by setting the standard `GOOGLE_APPLICATION_CREDENTIALS` environment variable.<br>
If this variable is missing, an API key should be provided via the `api_key` parameter.

```python
from img2table.ocr import VisionOCR

ocr = VisionOCR(api_key="api_key", timeout=15)
```

> <h4>Parameters</h4>
><dl>
>    <dt>api_key : str, optional, default <code>None</code></dt>
>    <dd style="font-style: italic;">Google Vision API key</dd>
>    <dt>timeout : int, optional, default <code>15</code></dt>
>    <dd style="font-style: italic;">API requests timeout, in seconds</dd>
></dl>
<br>
</details>

<details>
<summary>AWS Textract<a name="textract"></a></summary>
<br>

When using AWS Textract, the DetectDocumentText API is exclusively called.

Authentication to AWS can be done by passing credentials to the `TextractOCR` class.<br>
If credentials are not provided, authentication is done using environment variables or configuration files. 
Check `boto3` [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for more details.

```python
from img2table.ocr import TextractOCR

ocr = TextractOCR(aws_access_key_id="***",
                  aws_secret_access_key="***",
                  aws_session_token="***",
                  region="eu-west-1")
```

> <h4>Parameters</h4>
><dl>
>    <dt>aws_access_key_id : str, optional, default <code>None</code></dt>
>    <dd style="font-style: italic;">AWS access key id</dd>
>    <dt>aws_secret_access_key : str, optional, default <code>None</code></dt>
>    <dd style="font-style: italic;">AWS secret access key</dd>
>    <dt>aws_session_token : str, optional, default <code>None</code></dt>
>    <dd style="font-style: italic;">AWS temporary session token</dd>
>    <dt>region : str, optional, default <code>None</code></dt>
>    <dd style="font-style: italic;">AWS server region</dd>
></dl>
<br>
</details>

<details>
<summary>Azure Cognitive Services<a name="azure"></a></summary>
<br>

```python
from img2table.ocr import AzureOCR

ocr = AzureOCR(endpoint="abc.azure.com",
               subscription_key="***")
```

> <h4>Parameters</h4>
><dl>
>    <dt>endpoint : str, optional, default <code>None</code></dt>
>    <dd style="font-style: italic;">Azure Cognitive Services endpoint. If None, inferred from the <code>COMPUTER_VISION_ENDPOINT</code> environment variable.</dd>
>    <dt>subscription_key : str, optional, default <code>None</code></dt>
>    <dd style="font-style: italic;">Azure Cognitive Services subscription key. If None, inferred from the <code>COMPUTER_VISION_SUBSCRIPTION_KEY</code> environment variable.</dd>
></dl>
<br>
</details>

---

### Table extraction <a name="table-extract"></a>

Multiple tables can be extracted at once from a PDF page/ an image using the `extract_tables` method of a document.

```python
from img2table.ocr import TesseractOCR
from img2table.document import Image

# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")

# Instantiation of document, either an image or a PDF
doc = Image(src)

# Table extraction
extracted_tables = doc.extract_tables(ocr=ocr,
                                      implicit_rows=False,
                                      implicit_columns=False,
                                      borderless_tables=False,
                                      min_confidence=50)
```
> <h4>Parameters</h4>
><dl>
>    <dt>ocr : OCRInstance, optional, default <code>None</code></dt>
>    <dd style="font-style: italic;">OCR instance used to parse document text. If None, cells content will not be extracted</dd>
>    <dt>implicit_rows : bool, optional, default <code>False</code></dt>
>    <dd style="font-style: italic;">Boolean indicating if implicit rows should be identified - check related <a href="/examples/Implicit.ipynb" target="_self">example</a></dd>
>    <dt>implicit_columns : bool, optional, default <code>False</code></dt>
>    <dd style="font-style: italic;">Boolean indicating if implicit columns should be identified - check related <a href="/examples/Implicit.ipynb" target="_self">example</a></dd>
>    <dt>borderless_tables : bool, optional, default <code>False</code></dt>
>    <dd style="font-style: italic;">Boolean indicating if <a href="/examples/borderless.ipynb" target="_self">borderless tables</a> are extracted <b>on top of</b> bordered tables.</dd>
>    <dt>min_confidence : int, optional, default <code>50</code></dt>
>    <dd style="font-style: italic;">Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)</dd>
></dl>

<b>NB</b>: Borderless table extraction can, by design, only extract tables with 3 or more columns.

#### Method return

The [`ExtractedTable`](/src/img2table/tables/objects/extraction.py#L35) class is used to model extracted tables from documents.

> <h4>Attributes</h4>
><dl>
>    <dt>bbox : <code>BBox</code></dt>
>    <dd style="font-style: italic;">Table bounding box</dd>
>    <dt>title : str</dt>
>    <dd style="font-style: italic;">Extracted title of the table</dd>
>    <dt>content : <code>OrderedDict</code></dt>
>    <dd style="font-style: italic;">Dict with row indexes as keys and list of <code>TableCell</code> objects as values</dd>
>    <dt>df : <code>pd.DataFrame</code></dt>
>    <dd style="font-style: italic;">Pandas DataFrame representation of the table</dd>
>    <dt>html : <code>str</code></dt>
>    <dd style="font-style: italic;">HTML representation of the table</dd>
></dl>

<br>

In order to access bounding boxes at the cell level, you can use the following code snippet :
```python
for id_row, row in enumerate(table.content.values()):
    for id_col, cell in enumerate(row):
        x1 = cell.bbox.x1
        y1 = cell.bbox.y1
        x2 = cell.bbox.x2
        y2 = cell.bbox.y2
        value = cell.value
```

<h5 style="color:grey">Images</h5>

`extract_tables` method from the `Image` class returns a list of `ExtractedTable` objects. 
```Python
output = [ExtractedTable(...), ExtractedTable(...), ...]
```

<h5 style="color:grey">PDF</h5>

`extract_tables` method from the `PDF` class returns an `OrderedDict` object with page indexes as keys and lists of `ExtractedTable` objects. 
```Python
output = {
    0: [ExtractedTable(...), ...],
    1: [],
    ...
    last_page: [ExtractedTable(...), ...]
}
```


### Excel export <a name="xlsx"></a>

Tables extracted from a document can be exported to a xlsx file. The resulting file is composed of one worksheet per extracted table.<br>
Method arguments are mostly common with the `extract_tables` method.

```python
from img2table.ocr import TesseractOCR
from img2table.document import Image

# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")

# Instantiation of document, either an image or a PDF
doc = Image(src)

# Extraction of tables and creation of a xlsx file containing tables
doc.to_xlsx(dest=dest,
            ocr=ocr,
            implicit_rows=False,
            implicit_columns=False,
            borderless_tables=False,
            min_confidence=50)
```
> <h4>Parameters</h4>
><dl>
>    <dt>dest : str, <code>pathlib.Path</code> or <code>io.BytesIO</code>, required</dt>
>    <dd style="font-style: italic;">Destination for xlsx file</dd>
>    <dt>ocr : OCRInstance, optional, default <code>None</code></dt>
>    <dd style="font-style: italic;">OCR instance used to parse document text. If None, cells content will not be extracted</dd>
>    <dt>implicit_rows : bool, optional, default <code>False</code></dt>
>    <dd style="font-style: italic;">Boolean indicating if implicit rows should be identified - check related <a href="/examples/Implicit.ipynb" target="_self">example</a></dd>
>    <dt>implicit_rows : bool, optional, default <code>False</code></dt>
>    <dd style="font-style: italic;">Boolean indicating if implicit columns should be identified - check related <a href="/examples/Implicit.ipynb" target="_self">example</a></dd>
>    <dt>borderless_tables : bool, optional, default <code>False</code></dt>
>    <dd style="font-style: italic;">Boolean indicating if <a href="/examples/borderless.ipynb" target="_self">borderless tables</a> are extracted. It requires to provide an OCR to the method in order to be performed - <b>feature in alpha version</b></dd>
>    <dt>min_confidence : int, optional, default <code>50</code></dt>
>    <dd style="font-style: italic;">Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)</dd>
></dl>
> <h4>Returns</h4>
> If a <code>io.BytesIO</code> buffer is passed as dest arg, it is returned containing xlsx data



## Examples <a name="examples"></a>

Several Jupyter notebooks with examples are available :
<ul>
<li>
<a href="/examples/Basic_usage.ipynb" target="_self">Basic usage</a>: generic library usage, including examples with images, PDF and OCRs
</li>
<li>
<a href="/examples/borderless.ipynb" target="_self">Borderless tables</a>: specific examples dedicated to the extraction of borderless tables
</li>
<li>
<a href="/examples/Implicit.ipynb" target="_self">Implicit content</a>: illustrated effect 
of the parameter <code>implicit_rows</code>/<code>implicit_columns</code> of the <code>extract_tables</code> method
</li>
</ul>

## Caveats / FYI <a name="fyi"></a>

<ul>
<li>
For table extraction, results are highly dependent on OCR quality. By design, tables where no OCR data 
can be found are not returned.
</li>
<li>
The library is tailored for usage on documents with white/light background. 
Effectiveness can not be guaranteed on other type of documents. 
</li>
<li>
Table detection using only OpenCV processing can have some limitations. If the library fails to detect tables, 
you may check CNN based solutions.
</li>
</ul>

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/xavctn/img2table",
    "name": "img2table",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Xavier Canton",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/72/84/0b81a76b61dcaff06624d0924b50ccf33623988136bf6fd02ea9d677de72/img2table-1.4.0.tar.gz",
    "platform": null,
    "description": "# img2table\n\n`img2table` is a simple, easy to use, table identification and extraction Python Library based on [OpenCV](https://opencv.org/) image \nprocessing that supports most common image file formats as well as PDF files.\n\nThanks to its design, it provides a practical and lighter alternative to Neural Networks based solutions, especially for usage on CPU.\n\n## Table of contents\n* [Installation](#installation)\n* [Features](#features)\n* [Supported file formats](#supported-file-formats)\n* [Usage](#usage)\n   * [Documents](#documents)\n      * [Images](#images-doc)\n      * [PDF](#pdf-doc)\n   * [Supported OCRs](#ocr)\n   * [Table extraction](#table-extract)\n   * [Excel export](#xlsx)\n* [Examples](#examples)\n* [Caveats / FYI](#fyi)\n\n\n## Installation <a name=\"installation\"></a>\nThe library can be installed via pip:\n\n> <code>pip install img2table</code>: Standard installation, supporting Tesseract<br>\n> <code>pip install img2table[paddle]</code>: For usage with Paddle OCR<br>\n> <code>pip install img2table[easyocr]</code>: For usage with EasyOCR<br>\n> <code>pip install img2table[surya]</code>: For usage with Surya OCR<br>\n> <code>pip install img2table[gcp]</code>: For usage with Google Vision OCR<br>\n> <code>pip install img2table[aws]</code>: For usage with AWS Textract OCR<br>\n> <code>pip install img2table[azure]</code>: For usage with Azure Cognitive Services OCR\n\n## Features <a name=\"features\"></a>\n\n* Table identification for images and PDF files, including bounding boxes at the table cell level\n* Handling of complex table structures such as merged cells\n* Handling of implicit content - see [example](/examples/Implicit.ipynb)\n* Table content extraction by providing support for OCR services / tools\n* Extracted tables are returned as a simple object, including a Pandas DataFrame representation\n* Export extracted tables to an Excel file, preserving their original structure\n\n## Supported file formats <a name=\"supported-file-formats\"></a>\n\n### Images <a name=\"images-formats\"></a>\n\nImages are loaded using the `opencv-python` library, supported formats are listed below.\n\n<details>\n<summary>Supported image formats</summary>\n<br>\n\n<blockquote>\n<ul>\n<li>Windows bitmaps - <em>.bmp, </em>.dib</li>\n<li>JPEG files - <em>.jpeg, </em>.jpg, *.jpe</li>\n<li>JPEG 2000 files - *.jp2</li>\n<li>Portable Network Graphics - *.png</li>\n<li>WebP - *.webp</li>\n<li>Portable image format - <em>.pbm, </em>.pgm, <em>.ppm </em>.pxm, *.pnm</li>\n<li>PFM files - *.pfm</li>\n<li>Sun rasters - <em>.sr, </em>.ras</li>\n<li>TIFF files - <em>.tiff, </em>.tif</li>\n<li>OpenEXR Image files - *.exr</li>\n<li>Radiance HDR - <em>.hdr, </em>.pic</li>\n<li>Raster and Vector geospatial data supported by GDAL<br>\n<cite><a href=\"https://docs.opencv.org/4.x/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56\">OpenCV: Image file reading and writing</a></cite></li>\n</ul>\n</blockquote>\n</details>\nMulti-page images are not supported.\n\n---\n\n### PDF <a name=\"pdf-formats\"></a>\n\nBoth native and scanned PDF files are supported.\n\n## Usage <a name=\"usage\"></a>\n\n### Documents <a name=\"documents\"></a>\n\n#### Images <a name=\"images-doc\"></a>\nImages are instantiated as follows :\n```python\nfrom img2table.document import Image\n\nimage = Image(src, \n              detect_rotation=False)\n```\n\n> <h4>Parameters</h4>\n><dl>\n>    <dt>src : str, <code>pathlib.Path</code>, bytes or <code>io.BytesIO</code>, required</dt>\n>    <dd style=\"font-style: italic;\">Image source</dd>\n>    <dt>detect_rotation : bool, optional, default <code>False</code></dt>\n>    <dd style=\"font-style: italic;\">Detect and correct skew/rotation of the image</dd>\n></dl>\n<br>\nThe implemented method to handle skewed/rotated images supports skew angles up to 45\u00b0 and is\nbased on the publication by <a href=\"https://www.mdpi.com/2079-9292/9/1/55\">Huang, 2020</a>.<br>\nSetting the <code>detect_rotation</code> parameter to <code>True</code>, image coordinates and bounding boxes returned by other \nmethods might not correspond to the original image.\n\n#### PDF <a name=\"pdf-doc\"></a>\nPDF files are instantiated as follows :\n```python\nfrom img2table.document import PDF\n\npdf = PDF(src, \n          pages=[0, 2],\n          detect_rotation=False,\n          pdf_text_extraction=True)\n```\n\n> <h4>Parameters</h4>\n><dl>\n>    <dt>src : str, <code>pathlib.Path</code>, bytes or <code>io.BytesIO</code>, required</dt>\n>    <dd style=\"font-style: italic;\">PDF source</dd>\n>    <dt>pages : list, optional, default <code>None</code></dt>\n>    <dd style=\"font-style: italic;\">List of PDF page indexes to be processed. If None, all pages are processed</dd>\n>    <dt>detect_rotation : bool, optional, default <code>False</code></dt>\n>    <dd style=\"font-style: italic;\">Detect and correct skew/rotation of extracted images from the PDF</dd>\n>    <dt>pdf_text_extraction : bool, optional, default <code>True</code></dt>\n>    <dd style=\"font-style: italic;\">Extract text from the PDF file for native PDFs</dd>\n></dl>\n\nPDF pages are converted to images with a 200 DPI for table identification.\n\n---\n\n### OCR <a name=\"ocr\"></a>\n\n`img2table` provides an interface for several OCR services and tools in order to parse table content.<br>\nIf possible (i.e for native PDF), PDF text will be extracted directly from the file and the OCR service/tool will not be called.\n\n<details>\n<summary>Tesseract<a name=\"tesseract\"></a></summary>\n<br>\n\n```python\nfrom img2table.ocr import TesseractOCR\n\nocr = TesseractOCR(n_threads=1, \n                   lang=\"eng\", \n                   psm=11,\n                   tessdata_dir=\"...\")\n```\n\n> <h4>Parameters</h4>\n><dl>\n>    <dt>n_threads : int, optional, default <code>1</code></dt>\n>    <dd style=\"font-style: italic;\">Number of concurrent threads used to call Tesseract</dd>\n>    <dt>lang : str, optional, default <code>\"eng\"</code></dt>\n>    <dd style=\"font-style: italic;\">Lang parameter used in Tesseract for text extraction</dd>\n>    <dt>psm : int, optional, default <code>11</code></dt>\n>    <dd style=\"font-style: italic;\">PSM parameter used in Tesseract, run <code>tesseract --help-psm</code> for details</dd>\n>    <dt>tessdata_dir : str, optional, default <code>None</code></dt>\n>    <dd style=\"font-style: italic;\">Directory containing Tesseract traineddata files. If None, the <code>TESSDATA_PREFIX</code> env variable is used.</dd>\n></dl>\n\n\n*Usage of [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract) requires prior installation. \nCheck [documentation](https://tesseract-ocr.github.io/tessdoc/) for instructions.*\n<br>\n*For Windows users getting environment variable errors, you can check this [tutorial](https://linuxhint.com/install-tesseract-windows/)*\n<br>\n</details>\n\n<details>\n<summary>PaddleOCR<a name=\"paddle\"></a></summary>\n<br>\n\n<a href=\"https://github.com/PaddlePaddle/PaddleOCR\">PaddleOCR</a> is an open-source OCR based on Deep Learning models.<br>\nAt first use, relevant languages models will be downloaded.\n\n```python\nfrom img2table.ocr import PaddleOCR\n\nocr = PaddleOCR(lang=\"en\",\n                kw={\"kwarg\": kw_value, ...})\n```\n\n> <h4>Parameters</h4>\n><dl>\n>    <dt>lang : str, optional, default <code>\"en\"</code></dt>\n>    <dd style=\"font-style: italic;\">Lang parameter used in Paddle for text extraction, check <a href=\"https://github.com/Mushroomcat9998/PaddleOCR/blob/main/doc/doc_en/multi_languages_en.md#5-support-languages-and-abbreviations\">documentation for available languages</a></dd>\n>    <dt>kw : dict, optional, default <code>None</code></dt>\n>    <dd style=\"font-style: italic;\">Dictionary containing additional keyword arguments passed to the PaddleOCR constructor.</dd>\n></dl>\n\n<br>\n<b>NB:</b> For usage of PaddleOCR with GPU, the CUDA specific version of paddlepaddle-gpu has to be installed by the user manually \nas stated in this <a href=\"https://github.com/PaddlePaddle/PaddleOCR/issues/7993\">issue</a>.\n\n```bash\n# Example of installation with CUDA 11.8\npip install paddlepaddle-gpu==2.5.0rc1.post118 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html\npip install paddleocr img2table\n```\n\nIf you get an error trying to run PaddleOCR on Ubuntu,\nplease check this <a href=\"https://github.com/PaddlePaddle/PaddleOCR/discussions/9989#discussioncomment-6642037\">issue</a> for a working solution.\n\n<br>\n</details>\n\n\n<details>\n<summary>EasyOCR<a name=\"easyocr\"></a></summary>\n<br>\n\n<a href=\"https://github.com/JaidedAI/EasyOCR\">EasyOCR</a> is an open-source OCR based on Deep Learning models.<br>\nAt first use, relevant languages models will be downloaded.\n\n```python\nfrom img2table.ocr import EasyOCR\n\nocr = EasyOCR(lang=[\"en\"],\n              kw={\"kwarg\": kw_value, ...})\n```\n\n> <h4>Parameters</h4>\n><dl>\n>    <dt>lang : list, optional, default <code>[\"en\"]</code></dt>\n>    <dd style=\"font-style: italic;\">Lang parameter used in EasyOCR for text extraction, check <a href=\"https://www.jaided.ai/easyocr\">documentation for available languages</a></dd>\n>    <dt>kw : dict, optional, default <code>None</code></dt>\n>    <dd style=\"font-style: italic;\">Dictionary containing additional keyword arguments passed to the EasyOCR <code>Reader</code> constructor.</dd>\n></dl>\n\n<br>\n</details>\n\n<details>\n<summary>docTR<a name=\"docTR\"></a></summary>\n<br>\n\n<a href=\"https://github.com/mindee/doctr\">docTR</a> is an open-source OCR based on Deep Learning models.<br>\n*In order to be used, docTR has to be installed by the user beforehand. Installation procedures are detailed in\nthe package documentation*\n\n```python\nfrom img2table.ocr import DocTR\n\nocr = DocTR(detect_language=False,\n            kw={\"kwarg\": kw_value, ...})\n```\n\n> <h4>Parameters</h4>\n><dl>\n>    <dt>detect_language : bool, optional, default <code>False</code></dt>\n>    <dd style=\"font-style: italic;\">Parameter indicating if language prediction is run on the document</dd>\n>    <dt>kw : dict, optional, default <code>None</code></dt>\n>    <dd style=\"font-style: italic;\">Dictionary containing additional keyword arguments passed to the docTR <code>ocr_predictor</code> method.</dd>\n></dl>\n\n<br>\n</details>\n\n\n<details>\n<summary>Surya OCR<a name=\"surya\"></a></summary>\n<br>\n\n<b><i>Only available for <code>python >= 3.10</code></i></b><br>\n<a href=\"https://github.com/VikParuchuri/surya\">Surya</a> is an open-source OCR based on Deep Learning models.<br>\nAt first use, relevant models will be downloaded.\n\n```python\nfrom img2table.ocr import SuryaOCR\n\nocr = SuryaOCR(langs=[\"en\"])\n```\n\n> <h4>Parameters</h4>\n><dl>\n>    <dt>langs : list, optional, default <code>[\"en\"]</code></dt>\n>    <dd style=\"font-style: italic;\">Lang parameter used in Surya OCR for text extraction</dd>\n></dl>\n\n<br>\n</details>\n\n\n<details>\n<summary>Google Vision<a name=\"vision\"></a></summary>\n<br>\n\nAuthentication to GCP can be done by setting the standard `GOOGLE_APPLICATION_CREDENTIALS` environment variable.<br>\nIf this variable is missing, an API key should be provided via the `api_key` parameter.\n\n```python\nfrom img2table.ocr import VisionOCR\n\nocr = VisionOCR(api_key=\"api_key\", timeout=15)\n```\n\n> <h4>Parameters</h4>\n><dl>\n>    <dt>api_key : str, optional, default <code>None</code></dt>\n>    <dd style=\"font-style: italic;\">Google Vision API key</dd>\n>    <dt>timeout : int, optional, default <code>15</code></dt>\n>    <dd style=\"font-style: italic;\">API requests timeout, in seconds</dd>\n></dl>\n<br>\n</details>\n\n<details>\n<summary>AWS Textract<a name=\"textract\"></a></summary>\n<br>\n\nWhen using AWS Textract, the DetectDocumentText API is exclusively called.\n\nAuthentication to AWS can be done by passing credentials to the `TextractOCR` class.<br>\nIf credentials are not provided, authentication is done using environment variables or configuration files. \nCheck `boto3` [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for more details.\n\n```python\nfrom img2table.ocr import TextractOCR\n\nocr = TextractOCR(aws_access_key_id=\"***\",\n                  aws_secret_access_key=\"***\",\n                  aws_session_token=\"***\",\n                  region=\"eu-west-1\")\n```\n\n> <h4>Parameters</h4>\n><dl>\n>    <dt>aws_access_key_id : str, optional, default <code>None</code></dt>\n>    <dd style=\"font-style: italic;\">AWS access key id</dd>\n>    <dt>aws_secret_access_key : str, optional, default <code>None</code></dt>\n>    <dd style=\"font-style: italic;\">AWS secret access key</dd>\n>    <dt>aws_session_token : str, optional, default <code>None</code></dt>\n>    <dd style=\"font-style: italic;\">AWS temporary session token</dd>\n>    <dt>region : str, optional, default <code>None</code></dt>\n>    <dd style=\"font-style: italic;\">AWS server region</dd>\n></dl>\n<br>\n</details>\n\n<details>\n<summary>Azure Cognitive Services<a name=\"azure\"></a></summary>\n<br>\n\n```python\nfrom img2table.ocr import AzureOCR\n\nocr = AzureOCR(endpoint=\"abc.azure.com\",\n               subscription_key=\"***\")\n```\n\n> <h4>Parameters</h4>\n><dl>\n>    <dt>endpoint : str, optional, default <code>None</code></dt>\n>    <dd style=\"font-style: italic;\">Azure Cognitive Services endpoint. If None, inferred from the <code>COMPUTER_VISION_ENDPOINT</code> environment variable.</dd>\n>    <dt>subscription_key : str, optional, default <code>None</code></dt>\n>    <dd style=\"font-style: italic;\">Azure Cognitive Services subscription key. If None, inferred from the <code>COMPUTER_VISION_SUBSCRIPTION_KEY</code> environment variable.</dd>\n></dl>\n<br>\n</details>\n\n---\n\n### Table extraction <a name=\"table-extract\"></a>\n\nMultiple tables can be extracted at once from a PDF page/ an image using the `extract_tables` method of a document.\n\n```python\nfrom img2table.ocr import TesseractOCR\nfrom img2table.document import Image\n\n# Instantiation of OCR\nocr = TesseractOCR(n_threads=1, lang=\"eng\")\n\n# Instantiation of document, either an image or a PDF\ndoc = Image(src)\n\n# Table extraction\nextracted_tables = doc.extract_tables(ocr=ocr,\n                                      implicit_rows=False,\n                                      implicit_columns=False,\n                                      borderless_tables=False,\n                                      min_confidence=50)\n```\n> <h4>Parameters</h4>\n><dl>\n>    <dt>ocr : OCRInstance, optional, default <code>None</code></dt>\n>    <dd style=\"font-style: italic;\">OCR instance used to parse document text. If None, cells content will not be extracted</dd>\n>    <dt>implicit_rows : bool, optional, default <code>False</code></dt>\n>    <dd style=\"font-style: italic;\">Boolean indicating if implicit rows should be identified - check related <a href=\"/examples/Implicit.ipynb\" target=\"_self\">example</a></dd>\n>    <dt>implicit_columns : bool, optional, default <code>False</code></dt>\n>    <dd style=\"font-style: italic;\">Boolean indicating if implicit columns should be identified - check related <a href=\"/examples/Implicit.ipynb\" target=\"_self\">example</a></dd>\n>    <dt>borderless_tables : bool, optional, default <code>False</code></dt>\n>    <dd style=\"font-style: italic;\">Boolean indicating if <a href=\"/examples/borderless.ipynb\" target=\"_self\">borderless tables</a> are extracted <b>on top of</b> bordered tables.</dd>\n>    <dt>min_confidence : int, optional, default <code>50</code></dt>\n>    <dd style=\"font-style: italic;\">Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)</dd>\n></dl>\n\n<b>NB</b>: Borderless table extraction can, by design, only extract tables with 3 or more columns.\n\n#### Method return\n\nThe [`ExtractedTable`](/src/img2table/tables/objects/extraction.py#L35) class is used to model extracted tables from documents.\n\n> <h4>Attributes</h4>\n><dl>\n>    <dt>bbox : <code>BBox</code></dt>\n>    <dd style=\"font-style: italic;\">Table bounding box</dd>\n>    <dt>title : str</dt>\n>    <dd style=\"font-style: italic;\">Extracted title of the table</dd>\n>    <dt>content : <code>OrderedDict</code></dt>\n>    <dd style=\"font-style: italic;\">Dict with row indexes as keys and list of <code>TableCell</code> objects as values</dd>\n>    <dt>df : <code>pd.DataFrame</code></dt>\n>    <dd style=\"font-style: italic;\">Pandas DataFrame representation of the table</dd>\n>    <dt>html : <code>str</code></dt>\n>    <dd style=\"font-style: italic;\">HTML representation of the table</dd>\n></dl>\n\n<br>\n\nIn order to access bounding boxes at the cell level, you can use the following code snippet :\n```python\nfor id_row, row in enumerate(table.content.values()):\n    for id_col, cell in enumerate(row):\n        x1 = cell.bbox.x1\n        y1 = cell.bbox.y1\n        x2 = cell.bbox.x2\n        y2 = cell.bbox.y2\n        value = cell.value\n```\n\n<h5 style=\"color:grey\">Images</h5>\n\n`extract_tables` method from the `Image` class returns a list of `ExtractedTable` objects. \n```Python\noutput = [ExtractedTable(...), ExtractedTable(...), ...]\n```\n\n<h5 style=\"color:grey\">PDF</h5>\n\n`extract_tables` method from the `PDF` class returns an `OrderedDict` object with page indexes as keys and lists of `ExtractedTable` objects. \n```Python\noutput = {\n    0: [ExtractedTable(...), ...],\n    1: [],\n    ...\n    last_page: [ExtractedTable(...), ...]\n}\n```\n\n\n### Excel export <a name=\"xlsx\"></a>\n\nTables extracted from a document can be exported to a xlsx file. The resulting file is composed of one worksheet per extracted table.<br>\nMethod arguments are mostly common with the `extract_tables` method.\n\n```python\nfrom img2table.ocr import TesseractOCR\nfrom img2table.document import Image\n\n# Instantiation of OCR\nocr = TesseractOCR(n_threads=1, lang=\"eng\")\n\n# Instantiation of document, either an image or a PDF\ndoc = Image(src)\n\n# Extraction of tables and creation of a xlsx file containing tables\ndoc.to_xlsx(dest=dest,\n            ocr=ocr,\n            implicit_rows=False,\n            implicit_columns=False,\n            borderless_tables=False,\n            min_confidence=50)\n```\n> <h4>Parameters</h4>\n><dl>\n>    <dt>dest : str, <code>pathlib.Path</code> or <code>io.BytesIO</code>, required</dt>\n>    <dd style=\"font-style: italic;\">Destination for xlsx file</dd>\n>    <dt>ocr : OCRInstance, optional, default <code>None</code></dt>\n>    <dd style=\"font-style: italic;\">OCR instance used to parse document text. If None, cells content will not be extracted</dd>\n>    <dt>implicit_rows : bool, optional, default <code>False</code></dt>\n>    <dd style=\"font-style: italic;\">Boolean indicating if implicit rows should be identified - check related <a href=\"/examples/Implicit.ipynb\" target=\"_self\">example</a></dd>\n>    <dt>implicit_rows : bool, optional, default <code>False</code></dt>\n>    <dd style=\"font-style: italic;\">Boolean indicating if implicit columns should be identified - check related <a href=\"/examples/Implicit.ipynb\" target=\"_self\">example</a></dd>\n>    <dt>borderless_tables : bool, optional, default <code>False</code></dt>\n>    <dd style=\"font-style: italic;\">Boolean indicating if <a href=\"/examples/borderless.ipynb\" target=\"_self\">borderless tables</a> are extracted. It requires to provide an OCR to the method in order to be performed - <b>feature in alpha version</b></dd>\n>    <dt>min_confidence : int, optional, default <code>50</code></dt>\n>    <dd style=\"font-style: italic;\">Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)</dd>\n></dl>\n> <h4>Returns</h4>\n> If a <code>io.BytesIO</code> buffer is passed as dest arg, it is returned containing xlsx data\n\n\n\n## Examples <a name=\"examples\"></a>\n\nSeveral Jupyter notebooks with examples are available :\n<ul>\n<li>\n<a href=\"/examples/Basic_usage.ipynb\" target=\"_self\">Basic usage</a>: generic library usage, including examples with images, PDF and OCRs\n</li>\n<li>\n<a href=\"/examples/borderless.ipynb\" target=\"_self\">Borderless tables</a>: specific examples dedicated to the extraction of borderless tables\n</li>\n<li>\n<a href=\"/examples/Implicit.ipynb\" target=\"_self\">Implicit content</a>: illustrated effect \nof the parameter <code>implicit_rows</code>/<code>implicit_columns</code> of the <code>extract_tables</code> method\n</li>\n</ul>\n\n## Caveats / FYI <a name=\"fyi\"></a>\n\n<ul>\n<li>\nFor table extraction, results are highly dependent on OCR quality. By design, tables where no OCR data \ncan be found are not returned.\n</li>\n<li>\nThe library is tailored for usage on documents with white/light background. \nEffectiveness can not be guaranteed on other type of documents. \n</li>\n<li>\nTable detection using only OpenCV processing can have some limitations. If the library fails to detect tables, \nyou may check CNN based solutions.\n</li>\n</ul>\n\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing",
    "version": "1.4.0",
    "project_urls": {
        "Homepage": "https://github.com/xavctn/img2table"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "de9c98fc3e8b40261bcba1f3c32bad10af557b408f3a39616ccafb7ea2006604",
                "md5": "02a0ee0425b4f9bffee543b2e6323b88",
                "sha256": "1a1e27b4e8a92aa7bb1eafeaafe41db9106394a124f82c92db1fb600d67331d5"
            },
            "downloads": -1,
            "filename": "img2table-1.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "02a0ee0425b4f9bffee543b2e6323b88",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.8",
            "size": 92062,
            "upload_time": "2024-11-11T19:07:29",
            "upload_time_iso_8601": "2024-11-11T19:07:29.598394Z",
            "url": "https://files.pythonhosted.org/packages/de/9c/98fc3e8b40261bcba1f3c32bad10af557b408f3a39616ccafb7ea2006604/img2table-1.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "72840b81a76b61dcaff06624d0924b50ccf33623988136bf6fd02ea9d677de72",
                "md5": "5b9cb8944c09680b63f510d4ead99537",
                "sha256": "cccf7dea4dd8503f3f1665837baf5c86bb81735ad7daaec5dc5835f36951e79d"
            },
            "downloads": -1,
            "filename": "img2table-1.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "5b9cb8944c09680b63f510d4ead99537",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.8",
            "size": 3999544,
            "upload_time": "2024-11-11T19:07:31",
            "upload_time_iso_8601": "2024-11-11T19:07:31.271512Z",
            "url": "https://files.pythonhosted.org/packages/72/84/0b81a76b61dcaff06624d0924b50ccf33623988136bf6fd02ea9d677de72/img2table-1.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-11 19:07:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "xavctn",
    "github_project": "img2table",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "img2table"
}

Xavier Canton