# Image Processing and Text Extraction with Tesseract - multiprocessing
## pip install tesserparsing
## Tested against Python 3.11 / Windows 10
```python
Image Processing and Text Extraction with Tesseract
This module provides functions for image processing and text
extraction using Tesseract OCR. It includes the following functionalities:
1. **get_short_path_name(long_name):**
- Retrieves the short path name for a given long file name, primarily on Windows.
- Uses the `ctypes` library to call the `GetShortPathNameW` function.
2. **parse_tesseract:**
- Utilizes Tesseract OCR to extract text from a list of images concurrently.
- Supports multiprocessing with the `start_multiprocessing` and `MultiProcExecution` classes.
- Handles caching, subprocess execution, and result formatting.
- Returns a pandas DataFrame containing structured OCR results.
3. **_parse_tesseract:**
- Internal function for parallel execution of Tesseract OCR on a single image.
- Converts image data to PNG format and invokes Tesseract subprocess.
- Returns the standard output and standard error of the subprocess.
4. **Example Usage:**
- Demonstrates how to use `parse_tesseract` to extract text from a folder of PNG images.
- Outputs a pandas DataFrame with structured OCR results.
Usage:
from tesserparsing import parse_tesseract
from list_all_files_recursively import get_folder_file_complete_path # optional
folder = r"C:\testfolderall"
piclist = [
x.path for x in get_folder_file_complete_path(folder) if x.ext.lower() == ".png"
]
language = "por"
tesser_path = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
df = parse_tesseract(
piclist,
language,
tesser_path,
tesser_args=(),
usecache=True,
processes=5,
chunks=1,
print_stdout=False,
print_stderr=True,
)
# df
# Out[3]:
# aa_level aa_page_num aa_block_num ... aa_end_x aa_end_y aa_area
# 0 1 1 0 ... 1600 720 1152000
# 1 2 1 1 ... 1570 43 27684
# 2 3 1 1 ... 1570 43 27684
# 3 4 1 1 ... 1570 43 27684
# 4 5 1 1 ... 100 43 1156
# ... ... ... ... ... ... ...
# 5685 4 1 1 ... 130 44 1515
# 5686 5 1 1 ... 130 44 1515
# 5687 1 1 0 ... 115 27 3105
# 5688 1 1 0 ... 112 21 2352
# 5689 1 1 0 ... 81 105 8505
# [5690 rows x 20 columns]
Note:
This module requires the Tesseract OCR executable to be installed on the system.
Ensure the necessary dependencies are installed before using these functions.
```
Raw data
{
"_id": null,
"home_page": "https://github.com/hansalemaos/tesserparsing",
"name": "tesserparsing",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "multiprocessing,tesseract",
"author": "Johannes Fischer",
"author_email": "aulasparticularesdealemaosp@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/8b/43/8ce584dfff3f5926df22093265a34f3c61c12f649928bf4ec9e2c0e15553/tesserparsing-0.10.tar.gz",
"platform": null,
"description": "\r\n# Image Processing and Text Extraction with Tesseract - multiprocessing\r\n\r\n## pip install tesserparsing\r\n\r\n\r\n## Tested against Python 3.11 / Windows 10\r\n\r\n\r\n```python\r\nImage Processing and Text Extraction with Tesseract\r\n\r\nThis module provides functions for image processing and text\r\nextraction using Tesseract OCR. It includes the following functionalities:\r\n\r\n1. **get_short_path_name(long_name):**\r\n\t- Retrieves the short path name for a given long file name, primarily on Windows.\r\n\t- Uses the `ctypes` library to call the `GetShortPathNameW` function.\r\n\r\n2. **parse_tesseract:**\r\n\t- Utilizes Tesseract OCR to extract text from a list of images concurrently.\r\n\t- Supports multiprocessing with the `start_multiprocessing` and `MultiProcExecution` classes.\r\n\t- Handles caching, subprocess execution, and result formatting.\r\n\t- Returns a pandas DataFrame containing structured OCR results.\r\n\r\n3. **_parse_tesseract:**\r\n\t- Internal function for parallel execution of Tesseract OCR on a single image.\r\n\t- Converts image data to PNG format and invokes Tesseract subprocess.\r\n\t- Returns the standard output and standard error of the subprocess.\r\n\r\n4. **Example Usage:**\r\n\t- Demonstrates how to use `parse_tesseract` to extract text from a folder of PNG images.\r\n\t- Outputs a pandas DataFrame with structured OCR results.\r\n\r\nUsage:\r\n\tfrom tesserparsing import parse_tesseract\r\n\tfrom list_all_files_recursively import get_folder_file_complete_path # optional\r\n\tfolder = r\"C:\\testfolderall\"\r\n\tpiclist = [\r\n\t\tx.path for x in get_folder_file_complete_path(folder) if x.ext.lower() == \".png\"\r\n\t]\r\n\tlanguage = \"por\"\r\n\ttesser_path = r\"C:\\Program Files\\Tesseract-OCR\\tesseract.exe\"\r\n\tdf = parse_tesseract(\r\n\t\tpiclist,\r\n\t\tlanguage,\r\n\t\ttesser_path,\r\n\t\ttesser_args=(),\r\n\t\tusecache=True,\r\n\t\tprocesses=5,\r\n\t\tchunks=1,\r\n\t\tprint_stdout=False,\r\n\t\tprint_stderr=True,\r\n\t)\r\n\r\n\t# df\r\n\t# Out[3]:\r\n\t# aa_level aa_page_num aa_block_num ... aa_end_x aa_end_y aa_area\r\n\t# 0 1 1 0 ... 1600 720 1152000\r\n\t# 1 2 1 1 ... 1570 43 27684\r\n\t# 2 3 1 1 ... 1570 43 27684\r\n\t# 3 4 1 1 ... 1570 43 27684\r\n\t# 4 5 1 1 ... 100 43 1156\r\n\t# ... ... ... ... ... ... ...\r\n\t# 5685 4 1 1 ... 130 44 1515\r\n\t# 5686 5 1 1 ... 130 44 1515\r\n\t# 5687 1 1 0 ... 115 27 3105\r\n\t# 5688 1 1 0 ... 112 21 2352\r\n\t# 5689 1 1 0 ... 81 105 8505\r\n\t# [5690 rows x 20 columns]\r\n\r\nNote:\r\n\tThis module requires the Tesseract OCR executable to be installed on the system.\r\n\tEnsure the necessary dependencies are installed before using these functions.\r\n\r\n\r\n```\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Image Processing and Text Extraction with Tesseract - multiprocessing",
"version": "0.10",
"project_urls": {
"Homepage": "https://github.com/hansalemaos/tesserparsing"
},
"split_keywords": [
"multiprocessing",
"tesseract"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "60b63c9782c67c978983d8d47c119432ef9b2f8df23c1e25a3d302208eda0aa7",
"md5": "6cb60d8038387905e8e71942a9fd553e",
"sha256": "5d1f23b86537c6741cb84dfad462ad63306f9b9efedb51e1185c939b21d25872"
},
"downloads": -1,
"filename": "tesserparsing-0.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6cb60d8038387905e8e71942a9fd553e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 44767,
"upload_time": "2023-11-12T01:15:02",
"upload_time_iso_8601": "2023-11-12T01:15:02.999942Z",
"url": "https://files.pythonhosted.org/packages/60/b6/3c9782c67c978983d8d47c119432ef9b2f8df23c1e25a3d302208eda0aa7/tesserparsing-0.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "8b438ce584dfff3f5926df22093265a34f3c61c12f649928bf4ec9e2c0e15553",
"md5": "2b3fbd6c51653d7960b1619334f8539d",
"sha256": "4d8bdee022f559201f34e509ead10d7d5bd576237fb406ef238bdc72eae379e4"
},
"downloads": -1,
"filename": "tesserparsing-0.10.tar.gz",
"has_sig": false,
"md5_digest": "2b3fbd6c51653d7960b1619334f8539d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 43913,
"upload_time": "2023-11-12T01:15:05",
"upload_time_iso_8601": "2023-11-12T01:15:05.087831Z",
"url": "https://files.pythonhosted.org/packages/8b/43/8ce584dfff3f5926df22093265a34f3c61c12f649928bf4ec9e2c0e15553/tesserparsing-0.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-12 01:15:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "hansalemaos",
"github_project": "tesserparsing",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "tesserparsing"
}