multitessiocr


Namemultitessiocr JSON
Version 0.13 PyPI version JSON
download
home_pagehttps://github.com/hansalemaos/multitessiocr
SummaryPerforms a very fast OCR on a list of images (file path, url, base64, bytes, numpy, PIL ...) using Tesseract and returns the recognized text, its coordinates, and line-based word grouping in a DataFrame.
upload_time2023-11-14 16:19:05
maintainer
docs_urlNone
authorJohannes Fischer
requires_python
licenseMIT
keywords tesseract ocr grouping
VCS
bugtrack_url
requirements a_cv_imwrite_imread_plus a_pandas_ex_apply_ignore_exceptions lxml multiprocca numpy opencv_python pandas touchtouch
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# Performs a very fast OCR on a list of images (file path, url, base64, bytes, numpy, PIL ...) using Tesseract and returns the recognized text, its coordinates, and line-based word grouping in a DataFrame.

## Tested against Windows 10 / Python 3.11 / Anaconda

### pip install multitessiocr


```python
from multitessiocr import tesser_ocr

piclist = [
    r"C:\screeeni\35.png",
    r"C:\Users\hansc\Downloads\2023-11-12 00_48_43-.png",
    r"C:\Users\hansc\Downloads\WhatsApp Image 2023-09-19 at 12.03.41 PM.jpeg",
    r"C:\Users\hansc\Downloads\WhatsApp Image 2023-09-19 at 11.06.46 AM.jpeg",
    r"C:\Users\hansc\Downloads\WhatsApp Image 2023-09-19 at 11.06.33 AM.jpeg",
    r"C:\Users\hansc\Downloads\WhatsApp Image 2023-09-19 at 11.06.22 AM.jpeg",
    r"C:\Users\hansc\Downloads\WhatsApp Image 2023-09-19 at 12.03.42 PM.jpeg",
]

df = tesser_ocr(
    piclist=piclist,
    tesser_path=r"C:\Program Files\Tesseract-OCR\tesseract.exe",
    add_after_tesseract_path="",
    add_at_the_end="-l eng+por --psm 3",
    processes=5,
    chunks=5,
    print_stdout=False,
    print_stderr=True,
)

#         aa_text  aa_start_x  aa_start_y  aa_end_x  aa_end_y aa_object aa_type  aa_element_index  aa_page  aa_index aa_language aa_parents aa_all_children aa_direct_children aa_tag  aa_x_size  aa_x_descenders  aa_x_ascenders  aa_x_wconf  aa_baseline_1  aa_baseline_2  aa_document_index  aa_width  aa_height  aa_area  aa_center_x  aa_center_y
# 7  Ameta-markup         802         318       922       335      word    word                 1        1         3        <NA>     (4, 5)              ()                 ()   span       <NA>             <NA>            <NA>        77.0           <NA>           <NA>                  0       120         17     2040          862          326
# 8     language,         933         321      1001       335      word    word                 1        1         4        <NA>     (4, 5)              ()                 ()   span       <NA>             <NA>            <NA>        96.0           <NA>           <NA>                  0        68         14      952          967          328
# 9          used        1014         318      1050       331      word    word                 1        1         5        <NA>     (4, 5)              ()                 ()   span       <NA>             <NA>            <NA>        96.0           <NA>           <NA>                  0        36         13      468         1032          324


    Perform OCR on a list of images using Tesseract.

    Parameters:
    - piclist (list): List of image file paths.
    - tesser_path (str): Path to the Tesseract executable.
    - add_after_tesseract_path (str): Additional parameters to add after the Tesseract path.
    - add_at_the_end (str): Additional parameters to add at the end of Tesseract command.
    - processes (int): Number of parallel processes for image processing.
    - chunks (int): Number of chunks to divide the image list for parallel processing.
    - print_stdout (bool): Whether to print standard output during execution.
    - print_stderr (bool): Whether to print standard error during execution.

    Returns:
    - pd.DataFrame: A DataFrame containing parsed OCR results.

```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hansalemaos/multitessiocr",
    "name": "multitessiocr",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "tesseract,ocr,grouping",
    "author": "Johannes Fischer",
    "author_email": "aulasparticularesdealemaosp@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/db/20/1e750eebcadb3a80646a33832a9ad2773b5cb30a7b726a98b41123ce46a0/multitessiocr-0.13.tar.gz",
    "platform": null,
    "description": "\r\n# Performs a very fast OCR on a list of images (file path, url, base64, bytes, numpy, PIL ...) using Tesseract and returns the recognized text, its coordinates, and line-based word grouping in a DataFrame.\r\n\r\n## Tested against Windows 10 / Python 3.11 / Anaconda\r\n\r\n### pip install multitessiocr\r\n\r\n\r\n```python\r\nfrom multitessiocr import tesser_ocr\r\n\r\npiclist = [\r\n    r\"C:\\screeeni\\35.png\",\r\n    r\"C:\\Users\\hansc\\Downloads\\2023-11-12 00_48_43-.png\",\r\n    r\"C:\\Users\\hansc\\Downloads\\WhatsApp Image 2023-09-19 at 12.03.41 PM.jpeg\",\r\n    r\"C:\\Users\\hansc\\Downloads\\WhatsApp Image 2023-09-19 at 11.06.46 AM.jpeg\",\r\n    r\"C:\\Users\\hansc\\Downloads\\WhatsApp Image 2023-09-19 at 11.06.33 AM.jpeg\",\r\n    r\"C:\\Users\\hansc\\Downloads\\WhatsApp Image 2023-09-19 at 11.06.22 AM.jpeg\",\r\n    r\"C:\\Users\\hansc\\Downloads\\WhatsApp Image 2023-09-19 at 12.03.42 PM.jpeg\",\r\n]\r\n\r\ndf = tesser_ocr(\r\n    piclist=piclist,\r\n    tesser_path=r\"C:\\Program Files\\Tesseract-OCR\\tesseract.exe\",\r\n    add_after_tesseract_path=\"\",\r\n    add_at_the_end=\"-l eng+por --psm 3\",\r\n    processes=5,\r\n    chunks=5,\r\n    print_stdout=False,\r\n    print_stderr=True,\r\n)\r\n\r\n#         aa_text  aa_start_x  aa_start_y  aa_end_x  aa_end_y aa_object aa_type  aa_element_index  aa_page  aa_index aa_language aa_parents aa_all_children aa_direct_children aa_tag  aa_x_size  aa_x_descenders  aa_x_ascenders  aa_x_wconf  aa_baseline_1  aa_baseline_2  aa_document_index  aa_width  aa_height  aa_area  aa_center_x  aa_center_y\r\n# 7  Ameta-markup         802         318       922       335      word    word                 1        1         3        <NA>     (4, 5)              ()                 ()   span       <NA>             <NA>            <NA>        77.0           <NA>           <NA>                  0       120         17     2040          862          326\r\n# 8     language,         933         321      1001       335      word    word                 1        1         4        <NA>     (4, 5)              ()                 ()   span       <NA>             <NA>            <NA>        96.0           <NA>           <NA>                  0        68         14      952          967          328\r\n# 9          used        1014         318      1050       331      word    word                 1        1         5        <NA>     (4, 5)              ()                 ()   span       <NA>             <NA>            <NA>        96.0           <NA>           <NA>                  0        36         13      468         1032          324\r\n\r\n\r\n    Perform OCR on a list of images using Tesseract.\r\n\r\n    Parameters:\r\n    - piclist (list): List of image file paths.\r\n    - tesser_path (str): Path to the Tesseract executable.\r\n    - add_after_tesseract_path (str): Additional parameters to add after the Tesseract path.\r\n    - add_at_the_end (str): Additional parameters to add at the end of Tesseract command.\r\n    - processes (int): Number of parallel processes for image processing.\r\n    - chunks (int): Number of chunks to divide the image list for parallel processing.\r\n    - print_stdout (bool): Whether to print standard output during execution.\r\n    - print_stderr (bool): Whether to print standard error during execution.\r\n\r\n    Returns:\r\n    - pd.DataFrame: A DataFrame containing parsed OCR results.\r\n\r\n```\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Performs a very fast OCR on a list of images (file path, url, base64, bytes, numpy, PIL ...) using Tesseract and returns the recognized text, its coordinates, and line-based word grouping in a DataFrame.",
    "version": "0.13",
    "project_urls": {
        "Homepage": "https://github.com/hansalemaos/multitessiocr"
    },
    "split_keywords": [
        "tesseract",
        "ocr",
        "grouping"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "942e11515e8a705c386ef4635badbbcf850dac9089da4a708ca6eea1463045d2",
                "md5": "f6b99018ab62d98f35d22b639f7ad291",
                "sha256": "a1284aa512dfef8d66bd73b5ce2ba9ea37c35f82eb55302de550d86192d3aa5e"
            },
            "downloads": -1,
            "filename": "multitessiocr-0.13-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f6b99018ab62d98f35d22b639f7ad291",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 61183,
            "upload_time": "2023-11-14T16:19:03",
            "upload_time_iso_8601": "2023-11-14T16:19:03.149770Z",
            "url": "https://files.pythonhosted.org/packages/94/2e/11515e8a705c386ef4635badbbcf850dac9089da4a708ca6eea1463045d2/multitessiocr-0.13-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "db201e750eebcadb3a80646a33832a9ad2773b5cb30a7b726a98b41123ce46a0",
                "md5": "220bbba3cc12d3736bc2e7bbdf1da0d3",
                "sha256": "0fe74557a31497fd257383e5f8b62a8aef0f231b9b3293db25b8eecd729c7cc2"
            },
            "downloads": -1,
            "filename": "multitessiocr-0.13.tar.gz",
            "has_sig": false,
            "md5_digest": "220bbba3cc12d3736bc2e7bbdf1da0d3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 60750,
            "upload_time": "2023-11-14T16:19:05",
            "upload_time_iso_8601": "2023-11-14T16:19:05.215473Z",
            "url": "https://files.pythonhosted.org/packages/db/20/1e750eebcadb3a80646a33832a9ad2773b5cb30a7b726a98b41123ce46a0/multitessiocr-0.13.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-14 16:19:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hansalemaos",
    "github_project": "multitessiocr",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "a_cv_imwrite_imread_plus",
            "specs": []
        },
        {
            "name": "a_pandas_ex_apply_ignore_exceptions",
            "specs": []
        },
        {
            "name": "lxml",
            "specs": []
        },
        {
            "name": "multiprocca",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "opencv_python",
            "specs": []
        },
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "touchtouch",
            "specs": []
        }
    ],
    "lcname": "multitessiocr"
}
        
Elapsed time: 0.14364s