t-ocr


Namet-ocr JSON
Version 1.5.3 PyPI version JSON
download
home_pagehttps://www.thoughtful.ai/
SummaryThoughtful OCR Package
upload_time2024-09-18 12:40:01
maintainerNone
docs_urlNone
authorThoughtful
requires_python>=3.8
licenseNone
keywords thoughtful-ocr t-ocr t_ocr
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            T-OCR Library
--------------

This is a Python library that contains methods for easy and fast OCR parsing with various tools. Used to read scanned PDF documents and images with text.

|

Key Features
------------

1. File parsing is done in the AWS cloud, eliminating the need for additional dependencies on your machine.
2. Parsing of each page is done in parallel in AWS, resulting in a significantly faster parsing process.
3. Access to the T-OCR AWS part is required to use the library.
4. Combines the best OCR tools and provides a consistent interface, making it easy to use and switch between tools.

    .. code-block:: python

         # Parsing with Tesseract (Free OCR)
         free_ocr = FreeOCR()
         pages = free_ocr.read_pdf("1.pdf")
         for page in pages:
             print(page.full_text)

    .. code-block:: python

         # Parsing with Textract
         textract = Textract()
         pages = textract.read_pdf("1.pdf")
         for page in pages:
             print(page.full_text)

5. The package contains a large number of powerful methods that will help you save time and read parsed data faster and better.

    .. code-block:: python

         # Get dates etc. from text
         date_matches = get_valid_dates("Text - February 17, 2009, text...")  # 02/17/2009

         # Compare the real text with what the OCR read
         does_strings_match_by_fuzzy("Apple", "Appla", percentage=0.8)  # True
         does_text_contains_string_by_fuzzy("Apple", "Appla watch", percentage=0.8) # True

         text = "Long article about Appla watch"
         find_string_in_text_by_fuzzy("Apple", text, percentage=0.8)  # 19

6. Caching OCR results locally in `t_ocr_cache` folder is supported. Use this during development to avoid blocking parsing time.

    .. code-block:: python

         pdf_file = "file.pdf"
         # First run
         free_ocr.read_pdf("1.pdf", cache_data=True) # Takes 15 seconds
         # Every next run
         free_ocr.read_pdf("1.pdf", cache_data=True) # Takes 0.01 seconds

         # If the state of the file or arguments changes, it will save the cache for this call too
         free_ocr.read_pdf("1.pdf", dpi=500, cache_data=True) # Takes 15 seconds
         free_ocr.read_pdf("1.pdf", dpi=500, cache_data=True) # Takes 0.01 seconds

|

Supported OCR Tools
-------------------

**1. Tesseract (Free OCR)**

    - 🆓 Free
    - ✅ Read text
    - ❌ Read tables, vertical field values
    - ⬛ Simple form design preferred

**2. AWS Textract**

    - 💸 Paid
    - ✅ Read text
    - ✅ Read tables, vertical field values
    - 🌃 Flexible for different file designs

            

Raw data

            {
    "_id": null,
    "home_page": "https://www.thoughtful.ai/",
    "name": "t-ocr",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "thoughtful-ocr, t-ocr, t_ocr",
    "author": "Thoughtful",
    "author_email": "support@thoughtfulautomation.com",
    "download_url": "https://files.pythonhosted.org/packages/91/e0/d9aec0b22d60bebd5ae0513a2862401d1a5f731a0ba371d882bdbbf81c82/t_ocr-1.5.3.tar.gz",
    "platform": null,
    "description": "T-OCR Library\n--------------\n\nThis is a Python library that contains methods for easy and fast OCR parsing with various tools. Used to read scanned PDF documents and images with text.\n\n|\n\nKey Features\n------------\n\n1. File parsing is done in the AWS cloud, eliminating the need for additional dependencies on your machine.\n2. Parsing of each page is done in parallel in AWS, resulting in a significantly faster parsing process.\n3. Access to the T-OCR AWS part is required to use the library.\n4. Combines the best OCR tools and provides a consistent interface, making it easy to use and switch between tools.\n\n    .. code-block:: python\n\n         # Parsing with Tesseract (Free OCR)\n         free_ocr = FreeOCR()\n         pages = free_ocr.read_pdf(\"1.pdf\")\n         for page in pages:\n             print(page.full_text)\n\n    .. code-block:: python\n\n         # Parsing with Textract\n         textract = Textract()\n         pages = textract.read_pdf(\"1.pdf\")\n         for page in pages:\n             print(page.full_text)\n\n5. The package contains a large number of powerful methods that will help you save time and read parsed data faster and better.\n\n    .. code-block:: python\n\n         # Get dates etc. from text\n         date_matches = get_valid_dates(\"Text - February 17, 2009, text...\")  # 02/17/2009\n\n         # Compare the real text with what the OCR read\n         does_strings_match_by_fuzzy(\"Apple\", \"Appla\", percentage=0.8)  # True\n         does_text_contains_string_by_fuzzy(\"Apple\", \"Appla watch\", percentage=0.8) # True\n\n         text = \"Long article about Appla watch\"\n         find_string_in_text_by_fuzzy(\"Apple\", text, percentage=0.8)  # 19\n\n6. Caching OCR results locally in `t_ocr_cache` folder is supported. Use this during development to avoid blocking parsing time.\n\n    .. code-block:: python\n\n         pdf_file = \"file.pdf\"\n         # First run\n         free_ocr.read_pdf(\"1.pdf\", cache_data=True) # Takes 15 seconds\n         # Every next run\n         free_ocr.read_pdf(\"1.pdf\", cache_data=True) # Takes 0.01 seconds\n\n         # If the state of the file or arguments changes, it will save the cache for this call too\n         free_ocr.read_pdf(\"1.pdf\", dpi=500, cache_data=True) # Takes 15 seconds\n         free_ocr.read_pdf(\"1.pdf\", dpi=500, cache_data=True) # Takes 0.01 seconds\n\n|\n\nSupported OCR Tools\n-------------------\n\n**1. Tesseract (Free OCR)**\n\n    - \ud83c\udd93 Free\n    - \u2705 Read text\n    - \u274c Read tables, vertical field values\n    - \u2b1b Simple form design preferred\n\n**2. AWS Textract**\n\n    - \ud83d\udcb8 Paid\n    - \u2705 Read text\n    - \u2705 Read tables, vertical field values\n    - \ud83c\udf03 Flexible for different file designs\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Thoughtful OCR Package",
    "version": "1.5.3",
    "project_urls": {
        "Homepage": "https://www.thoughtful.ai/"
    },
    "split_keywords": [
        "thoughtful-ocr",
        " t-ocr",
        " t_ocr"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "91e0d9aec0b22d60bebd5ae0513a2862401d1a5f731a0ba371d882bdbbf81c82",
                "md5": "f1e3661c56da94fcd4c855a0c0fa4bd0",
                "sha256": "1a816ecc3c1c78e7be47a2268528d4d8f8acd711495ff39ed60155ba12e398e1"
            },
            "downloads": -1,
            "filename": "t_ocr-1.5.3.tar.gz",
            "has_sig": false,
            "md5_digest": "f1e3661c56da94fcd4c855a0c0fa4bd0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 27629,
            "upload_time": "2024-09-18T12:40:01",
            "upload_time_iso_8601": "2024-09-18T12:40:01.602912Z",
            "url": "https://files.pythonhosted.org/packages/91/e0/d9aec0b22d60bebd5ae0513a2862401d1a5f731a0ba371d882bdbbf81c82/t_ocr-1.5.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-18 12:40:01",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "t-ocr"
}
        
Elapsed time: 0.66811s