# Performs a very fast OCR on a list of images (file path, url, base64, bytes, numpy, PIL ...) using Tesseract and returns the recognized text, its coordinates, and line-based word grouping in a DataFrame.
## Tested against Windows 10 / Python 3.11 / Anaconda
### pip install multitessiocr
```python
from multitessiocr import tesser_ocr
piclist = [
r"C:\screeeni\35.png",
r"C:\Users\hansc\Downloads\2023-11-12 00_48_43-.png",
r"C:\Users\hansc\Downloads\WhatsApp Image 2023-09-19 at 12.03.41 PM.jpeg",
r"C:\Users\hansc\Downloads\WhatsApp Image 2023-09-19 at 11.06.46 AM.jpeg",
r"C:\Users\hansc\Downloads\WhatsApp Image 2023-09-19 at 11.06.33 AM.jpeg",
r"C:\Users\hansc\Downloads\WhatsApp Image 2023-09-19 at 11.06.22 AM.jpeg",
r"C:\Users\hansc\Downloads\WhatsApp Image 2023-09-19 at 12.03.42 PM.jpeg",
]
df = tesser_ocr(
piclist=piclist,
tesser_path=r"C:\Program Files\Tesseract-OCR\tesseract.exe",
add_after_tesseract_path="",
add_at_the_end="-l eng+por --psm 3",
processes=5,
chunks=5,
print_stdout=False,
print_stderr=True,
)
# aa_text aa_start_x aa_start_y aa_end_x aa_end_y aa_object aa_type aa_element_index aa_page aa_index aa_language aa_parents aa_all_children aa_direct_children aa_tag aa_x_size aa_x_descenders aa_x_ascenders aa_x_wconf aa_baseline_1 aa_baseline_2 aa_document_index aa_width aa_height aa_area aa_center_x aa_center_y
# 7 Ameta-markup 802 318 922 335 word word 1 1 3 <NA> (4, 5) () () span <NA> <NA> <NA> 77.0 <NA> <NA> 0 120 17 2040 862 326
# 8 language, 933 321 1001 335 word word 1 1 4 <NA> (4, 5) () () span <NA> <NA> <NA> 96.0 <NA> <NA> 0 68 14 952 967 328
# 9 used 1014 318 1050 331 word word 1 1 5 <NA> (4, 5) () () span <NA> <NA> <NA> 96.0 <NA> <NA> 0 36 13 468 1032 324
Perform OCR on a list of images using Tesseract.
Parameters:
- piclist (list): List of image file paths.
- tesser_path (str): Path to the Tesseract executable.
- add_after_tesseract_path (str): Additional parameters to add after the Tesseract path.
- add_at_the_end (str): Additional parameters to add at the end of Tesseract command.
- processes (int): Number of parallel processes for image processing.
- chunks (int): Number of chunks to divide the image list for parallel processing.
- print_stdout (bool): Whether to print standard output during execution.
- print_stderr (bool): Whether to print standard error during execution.
Returns:
- pd.DataFrame: A DataFrame containing parsed OCR results.
```
Raw data
{
"_id": null,
"home_page": "https://github.com/hansalemaos/multitessiocr",
"name": "multitessiocr",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "tesseract,ocr,grouping",
"author": "Johannes Fischer",
"author_email": "aulasparticularesdealemaosp@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/db/20/1e750eebcadb3a80646a33832a9ad2773b5cb30a7b726a98b41123ce46a0/multitessiocr-0.13.tar.gz",
"platform": null,
"description": "\r\n# Performs a very fast OCR on a list of images (file path, url, base64, bytes, numpy, PIL ...) using Tesseract and returns the recognized text, its coordinates, and line-based word grouping in a DataFrame.\r\n\r\n## Tested against Windows 10 / Python 3.11 / Anaconda\r\n\r\n### pip install multitessiocr\r\n\r\n\r\n```python\r\nfrom multitessiocr import tesser_ocr\r\n\r\npiclist = [\r\n r\"C:\\screeeni\\35.png\",\r\n r\"C:\\Users\\hansc\\Downloads\\2023-11-12 00_48_43-.png\",\r\n r\"C:\\Users\\hansc\\Downloads\\WhatsApp Image 2023-09-19 at 12.03.41 PM.jpeg\",\r\n r\"C:\\Users\\hansc\\Downloads\\WhatsApp Image 2023-09-19 at 11.06.46 AM.jpeg\",\r\n r\"C:\\Users\\hansc\\Downloads\\WhatsApp Image 2023-09-19 at 11.06.33 AM.jpeg\",\r\n r\"C:\\Users\\hansc\\Downloads\\WhatsApp Image 2023-09-19 at 11.06.22 AM.jpeg\",\r\n r\"C:\\Users\\hansc\\Downloads\\WhatsApp Image 2023-09-19 at 12.03.42 PM.jpeg\",\r\n]\r\n\r\ndf = tesser_ocr(\r\n piclist=piclist,\r\n tesser_path=r\"C:\\Program Files\\Tesseract-OCR\\tesseract.exe\",\r\n add_after_tesseract_path=\"\",\r\n add_at_the_end=\"-l eng+por --psm 3\",\r\n processes=5,\r\n chunks=5,\r\n print_stdout=False,\r\n print_stderr=True,\r\n)\r\n\r\n# aa_text aa_start_x aa_start_y aa_end_x aa_end_y aa_object aa_type aa_element_index aa_page aa_index aa_language aa_parents aa_all_children aa_direct_children aa_tag aa_x_size aa_x_descenders aa_x_ascenders aa_x_wconf aa_baseline_1 aa_baseline_2 aa_document_index aa_width aa_height aa_area aa_center_x aa_center_y\r\n# 7 Ameta-markup 802 318 922 335 word word 1 1 3 <NA> (4, 5) () () span <NA> <NA> <NA> 77.0 <NA> <NA> 0 120 17 2040 862 326\r\n# 8 language, 933 321 1001 335 word word 1 1 4 <NA> (4, 5) () () span <NA> <NA> <NA> 96.0 <NA> <NA> 0 68 14 952 967 328\r\n# 9 used 1014 318 1050 331 word word 1 1 5 <NA> (4, 5) () () span <NA> <NA> <NA> 96.0 <NA> <NA> 0 36 13 468 1032 324\r\n\r\n\r\n Perform OCR on a list of images using Tesseract.\r\n\r\n Parameters:\r\n - piclist (list): List of image file paths.\r\n - tesser_path (str): Path to the Tesseract executable.\r\n - add_after_tesseract_path (str): Additional parameters to add after the Tesseract path.\r\n - add_at_the_end (str): Additional parameters to add at the end of Tesseract command.\r\n - processes (int): Number of parallel processes for image processing.\r\n - chunks (int): Number of chunks to divide the image list for parallel processing.\r\n - print_stdout (bool): Whether to print standard output during execution.\r\n - print_stderr (bool): Whether to print standard error during execution.\r\n\r\n Returns:\r\n - pd.DataFrame: A DataFrame containing parsed OCR results.\r\n\r\n```\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Performs a very fast OCR on a list of images (file path, url, base64, bytes, numpy, PIL ...) using Tesseract and returns the recognized text, its coordinates, and line-based word grouping in a DataFrame.",
"version": "0.13",
"project_urls": {
"Homepage": "https://github.com/hansalemaos/multitessiocr"
},
"split_keywords": [
"tesseract",
"ocr",
"grouping"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "942e11515e8a705c386ef4635badbbcf850dac9089da4a708ca6eea1463045d2",
"md5": "f6b99018ab62d98f35d22b639f7ad291",
"sha256": "a1284aa512dfef8d66bd73b5ce2ba9ea37c35f82eb55302de550d86192d3aa5e"
},
"downloads": -1,
"filename": "multitessiocr-0.13-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f6b99018ab62d98f35d22b639f7ad291",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 61183,
"upload_time": "2023-11-14T16:19:03",
"upload_time_iso_8601": "2023-11-14T16:19:03.149770Z",
"url": "https://files.pythonhosted.org/packages/94/2e/11515e8a705c386ef4635badbbcf850dac9089da4a708ca6eea1463045d2/multitessiocr-0.13-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "db201e750eebcadb3a80646a33832a9ad2773b5cb30a7b726a98b41123ce46a0",
"md5": "220bbba3cc12d3736bc2e7bbdf1da0d3",
"sha256": "0fe74557a31497fd257383e5f8b62a8aef0f231b9b3293db25b8eecd729c7cc2"
},
"downloads": -1,
"filename": "multitessiocr-0.13.tar.gz",
"has_sig": false,
"md5_digest": "220bbba3cc12d3736bc2e7bbdf1da0d3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 60750,
"upload_time": "2023-11-14T16:19:05",
"upload_time_iso_8601": "2023-11-14T16:19:05.215473Z",
"url": "https://files.pythonhosted.org/packages/db/20/1e750eebcadb3a80646a33832a9ad2773b5cb30a7b726a98b41123ce46a0/multitessiocr-0.13.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-14 16:19:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "hansalemaos",
"github_project": "multitessiocr",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "a_cv_imwrite_imread_plus",
"specs": []
},
{
"name": "a_pandas_ex_apply_ignore_exceptions",
"specs": []
},
{
"name": "lxml",
"specs": []
},
{
"name": "multiprocca",
"specs": []
},
{
"name": "numpy",
"specs": []
},
{
"name": "opencv_python",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "touchtouch",
"specs": []
}
],
"lcname": "multitessiocr"
}