tesserparsing


Nametesserparsing JSON
Version 0.10 PyPI version JSON
download
home_pagehttps://github.com/hansalemaos/tesserparsing
SummaryImage Processing and Text Extraction with Tesseract - multiprocessing
upload_time2023-11-12 01:15:05
maintainer
docs_urlNone
authorJohannes Fischer
requires_python
licenseMIT
keywords multiprocessing tesseract
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# Image Processing and Text Extraction with Tesseract - multiprocessing

## pip install tesserparsing


## Tested against Python 3.11 / Windows 10


```python
Image Processing and Text Extraction with Tesseract

This module provides functions for image processing and text
extraction using Tesseract OCR. It includes the following functionalities:

1. **get_short_path_name(long_name):**
	- Retrieves the short path name for a given long file name, primarily on Windows.
	- Uses the `ctypes` library to call the `GetShortPathNameW` function.

2. **parse_tesseract:**
	- Utilizes Tesseract OCR to extract text from a list of images concurrently.
	- Supports multiprocessing with the `start_multiprocessing` and `MultiProcExecution` classes.
	- Handles caching, subprocess execution, and result formatting.
	- Returns a pandas DataFrame containing structured OCR results.

3. **_parse_tesseract:**
	- Internal function for parallel execution of Tesseract OCR on a single image.
	- Converts image data to PNG format and invokes Tesseract subprocess.
	- Returns the standard output and standard error of the subprocess.

4. **Example Usage:**
	- Demonstrates how to use `parse_tesseract` to extract text from a folder of PNG images.
	- Outputs a pandas DataFrame with structured OCR results.

Usage:
	from tesserparsing import parse_tesseract
	from list_all_files_recursively import get_folder_file_complete_path # optional
	folder = r"C:\testfolderall"
	piclist = [
		x.path for x in get_folder_file_complete_path(folder) if x.ext.lower() == ".png"
	]
	language = "por"
	tesser_path = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
	df = parse_tesseract(
		piclist,
		language,
		tesser_path,
		tesser_args=(),
		usecache=True,
		processes=5,
		chunks=1,
		print_stdout=False,
		print_stderr=True,
	)

	# df
	# Out[3]:
	#       aa_level  aa_page_num  aa_block_num  ...  aa_end_x  aa_end_y  aa_area
	# 0            1            1             0  ...      1600       720  1152000
	# 1            2            1             1  ...      1570        43    27684
	# 2            3            1             1  ...      1570        43    27684
	# 3            4            1             1  ...      1570        43    27684
	# 4            5            1             1  ...       100        43     1156
	#         ...          ...           ...  ...       ...       ...      ...
	# 5685         4            1             1  ...       130        44     1515
	# 5686         5            1             1  ...       130        44     1515
	# 5687         1            1             0  ...       115        27     3105
	# 5688         1            1             0  ...       112        21     2352
	# 5689         1            1             0  ...        81       105     8505
	# [5690 rows x 20 columns]

Note:
	This module requires the Tesseract OCR executable to be installed on the system.
	Ensure the necessary dependencies are installed before using these functions.


```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hansalemaos/tesserparsing",
    "name": "tesserparsing",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "multiprocessing,tesseract",
    "author": "Johannes Fischer",
    "author_email": "aulasparticularesdealemaosp@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/8b/43/8ce584dfff3f5926df22093265a34f3c61c12f649928bf4ec9e2c0e15553/tesserparsing-0.10.tar.gz",
    "platform": null,
    "description": "\r\n# Image Processing and Text Extraction with Tesseract - multiprocessing\r\n\r\n## pip install tesserparsing\r\n\r\n\r\n## Tested against Python 3.11 / Windows 10\r\n\r\n\r\n```python\r\nImage Processing and Text Extraction with Tesseract\r\n\r\nThis module provides functions for image processing and text\r\nextraction using Tesseract OCR. It includes the following functionalities:\r\n\r\n1. **get_short_path_name(long_name):**\r\n\t- Retrieves the short path name for a given long file name, primarily on Windows.\r\n\t- Uses the `ctypes` library to call the `GetShortPathNameW` function.\r\n\r\n2. **parse_tesseract:**\r\n\t- Utilizes Tesseract OCR to extract text from a list of images concurrently.\r\n\t- Supports multiprocessing with the `start_multiprocessing` and `MultiProcExecution` classes.\r\n\t- Handles caching, subprocess execution, and result formatting.\r\n\t- Returns a pandas DataFrame containing structured OCR results.\r\n\r\n3. **_parse_tesseract:**\r\n\t- Internal function for parallel execution of Tesseract OCR on a single image.\r\n\t- Converts image data to PNG format and invokes Tesseract subprocess.\r\n\t- Returns the standard output and standard error of the subprocess.\r\n\r\n4. **Example Usage:**\r\n\t- Demonstrates how to use `parse_tesseract` to extract text from a folder of PNG images.\r\n\t- Outputs a pandas DataFrame with structured OCR results.\r\n\r\nUsage:\r\n\tfrom tesserparsing import parse_tesseract\r\n\tfrom list_all_files_recursively import get_folder_file_complete_path # optional\r\n\tfolder = r\"C:\\testfolderall\"\r\n\tpiclist = [\r\n\t\tx.path for x in get_folder_file_complete_path(folder) if x.ext.lower() == \".png\"\r\n\t]\r\n\tlanguage = \"por\"\r\n\ttesser_path = r\"C:\\Program Files\\Tesseract-OCR\\tesseract.exe\"\r\n\tdf = parse_tesseract(\r\n\t\tpiclist,\r\n\t\tlanguage,\r\n\t\ttesser_path,\r\n\t\ttesser_args=(),\r\n\t\tusecache=True,\r\n\t\tprocesses=5,\r\n\t\tchunks=1,\r\n\t\tprint_stdout=False,\r\n\t\tprint_stderr=True,\r\n\t)\r\n\r\n\t# df\r\n\t# Out[3]:\r\n\t#       aa_level  aa_page_num  aa_block_num  ...  aa_end_x  aa_end_y  aa_area\r\n\t# 0            1            1             0  ...      1600       720  1152000\r\n\t# 1            2            1             1  ...      1570        43    27684\r\n\t# 2            3            1             1  ...      1570        43    27684\r\n\t# 3            4            1             1  ...      1570        43    27684\r\n\t# 4            5            1             1  ...       100        43     1156\r\n\t#         ...          ...           ...  ...       ...       ...      ...\r\n\t# 5685         4            1             1  ...       130        44     1515\r\n\t# 5686         5            1             1  ...       130        44     1515\r\n\t# 5687         1            1             0  ...       115        27     3105\r\n\t# 5688         1            1             0  ...       112        21     2352\r\n\t# 5689         1            1             0  ...        81       105     8505\r\n\t# [5690 rows x 20 columns]\r\n\r\nNote:\r\n\tThis module requires the Tesseract OCR executable to be installed on the system.\r\n\tEnsure the necessary dependencies are installed before using these functions.\r\n\r\n\r\n```\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Image Processing and Text Extraction with Tesseract - multiprocessing",
    "version": "0.10",
    "project_urls": {
        "Homepage": "https://github.com/hansalemaos/tesserparsing"
    },
    "split_keywords": [
        "multiprocessing",
        "tesseract"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "60b63c9782c67c978983d8d47c119432ef9b2f8df23c1e25a3d302208eda0aa7",
                "md5": "6cb60d8038387905e8e71942a9fd553e",
                "sha256": "5d1f23b86537c6741cb84dfad462ad63306f9b9efedb51e1185c939b21d25872"
            },
            "downloads": -1,
            "filename": "tesserparsing-0.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6cb60d8038387905e8e71942a9fd553e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 44767,
            "upload_time": "2023-11-12T01:15:02",
            "upload_time_iso_8601": "2023-11-12T01:15:02.999942Z",
            "url": "https://files.pythonhosted.org/packages/60/b6/3c9782c67c978983d8d47c119432ef9b2f8df23c1e25a3d302208eda0aa7/tesserparsing-0.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8b438ce584dfff3f5926df22093265a34f3c61c12f649928bf4ec9e2c0e15553",
                "md5": "2b3fbd6c51653d7960b1619334f8539d",
                "sha256": "4d8bdee022f559201f34e509ead10d7d5bd576237fb406ef238bdc72eae379e4"
            },
            "downloads": -1,
            "filename": "tesserparsing-0.10.tar.gz",
            "has_sig": false,
            "md5_digest": "2b3fbd6c51653d7960b1619334f8539d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 43913,
            "upload_time": "2023-11-12T01:15:05",
            "upload_time_iso_8601": "2023-11-12T01:15:05.087831Z",
            "url": "https://files.pythonhosted.org/packages/8b/43/8ce584dfff3f5926df22093265a34f3c61c12f649928bf4ec9e2c0e15553/tesserparsing-0.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-12 01:15:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hansalemaos",
    "github_project": "tesserparsing",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "tesserparsing"
}
        
Elapsed time: 6.50766s