T-OCR Library
--------------
This is a Python library that contains methods for easy and fast OCR parsing with various tools. Used to read scanned PDF documents and images with text.
|
Key Features
------------
1. File parsing is done in the AWS cloud, eliminating the need for additional dependencies on your machine.
2. Parsing of each page is done in parallel in AWS, resulting in a significantly faster parsing process.
3. Access to the T-OCR AWS part is required to use the library.
4. Combines the best OCR tools and provides a consistent interface, making it easy to use and switch between tools.
.. code-block:: python
# Parsing with Tesseract (Free OCR)
free_ocr = FreeOCR()
pages = free_ocr.read_pdf("1.pdf")
for page in pages:
print(page.full_text)
.. code-block:: python
# Parsing with Textract
textract = Textract()
pages = textract.read_pdf("1.pdf")
for page in pages:
print(page.full_text)
5. The package contains a large number of powerful methods that will help you save time and read parsed data faster and better.
.. code-block:: python
# Get dates etc. from text
date_matches = get_valid_dates("Text - February 17, 2009, text...") # 02/17/2009
# Compare the real text with what the OCR read
does_strings_match_by_fuzzy("Apple", "Appla", percentage=0.8) # True
does_text_contains_string_by_fuzzy("Apple", "Appla watch", percentage=0.8) # True
text = "Long article about Appla watch"
find_string_in_text_by_fuzzy("Apple", text, percentage=0.8) # 19
6. Caching OCR results locally in `t_ocr_cache` folder is supported. Use this during development to avoid blocking parsing time.
.. code-block:: python
pdf_file = "file.pdf"
# First run
free_ocr.read_pdf("1.pdf", cache_data=True) # Takes 15 seconds
# Every next run
free_ocr.read_pdf("1.pdf", cache_data=True) # Takes 0.01 seconds
# If the state of the file or arguments changes, it will save the cache for this call too
free_ocr.read_pdf("1.pdf", dpi=500, cache_data=True) # Takes 15 seconds
free_ocr.read_pdf("1.pdf", dpi=500, cache_data=True) # Takes 0.01 seconds
|
Supported OCR Tools
-------------------
**1. Tesseract (Free OCR)**
- 🆓 Free
- ✅ Read text
- ❌ Read tables, vertical field values
- ⬛ Simple form design preferred
**2. AWS Textract**
- 💸 Paid
- ✅ Read text
- ✅ Read tables, vertical field values
- 🌃 Flexible for different file designs
Raw data
{
"_id": null,
"home_page": "https://www.thoughtful.ai/",
"name": "t-ocr",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "thoughtful-ocr, t-ocr, t_ocr",
"author": "Thoughtful",
"author_email": "support@thoughtfulautomation.com",
"download_url": "https://files.pythonhosted.org/packages/91/e0/d9aec0b22d60bebd5ae0513a2862401d1a5f731a0ba371d882bdbbf81c82/t_ocr-1.5.3.tar.gz",
"platform": null,
"description": "T-OCR Library\n--------------\n\nThis is a Python library that contains methods for easy and fast OCR parsing with various tools. Used to read scanned PDF documents and images with text.\n\n|\n\nKey Features\n------------\n\n1. File parsing is done in the AWS cloud, eliminating the need for additional dependencies on your machine.\n2. Parsing of each page is done in parallel in AWS, resulting in a significantly faster parsing process.\n3. Access to the T-OCR AWS part is required to use the library.\n4. Combines the best OCR tools and provides a consistent interface, making it easy to use and switch between tools.\n\n .. code-block:: python\n\n # Parsing with Tesseract (Free OCR)\n free_ocr = FreeOCR()\n pages = free_ocr.read_pdf(\"1.pdf\")\n for page in pages:\n print(page.full_text)\n\n .. code-block:: python\n\n # Parsing with Textract\n textract = Textract()\n pages = textract.read_pdf(\"1.pdf\")\n for page in pages:\n print(page.full_text)\n\n5. The package contains a large number of powerful methods that will help you save time and read parsed data faster and better.\n\n .. code-block:: python\n\n # Get dates etc. from text\n date_matches = get_valid_dates(\"Text - February 17, 2009, text...\") # 02/17/2009\n\n # Compare the real text with what the OCR read\n does_strings_match_by_fuzzy(\"Apple\", \"Appla\", percentage=0.8) # True\n does_text_contains_string_by_fuzzy(\"Apple\", \"Appla watch\", percentage=0.8) # True\n\n text = \"Long article about Appla watch\"\n find_string_in_text_by_fuzzy(\"Apple\", text, percentage=0.8) # 19\n\n6. Caching OCR results locally in `t_ocr_cache` folder is supported. Use this during development to avoid blocking parsing time.\n\n .. code-block:: python\n\n pdf_file = \"file.pdf\"\n # First run\n free_ocr.read_pdf(\"1.pdf\", cache_data=True) # Takes 15 seconds\n # Every next run\n free_ocr.read_pdf(\"1.pdf\", cache_data=True) # Takes 0.01 seconds\n\n # If the state of the file or arguments changes, it will save the cache for this call too\n free_ocr.read_pdf(\"1.pdf\", dpi=500, cache_data=True) # Takes 15 seconds\n free_ocr.read_pdf(\"1.pdf\", dpi=500, cache_data=True) # Takes 0.01 seconds\n\n|\n\nSupported OCR Tools\n-------------------\n\n**1. Tesseract (Free OCR)**\n\n - \ud83c\udd93 Free\n - \u2705 Read text\n - \u274c Read tables, vertical field values\n - \u2b1b Simple form design preferred\n\n**2. AWS Textract**\n\n - \ud83d\udcb8 Paid\n - \u2705 Read text\n - \u2705 Read tables, vertical field values\n - \ud83c\udf03 Flexible for different file designs\n",
"bugtrack_url": null,
"license": null,
"summary": "Thoughtful OCR Package",
"version": "1.5.3",
"project_urls": {
"Homepage": "https://www.thoughtful.ai/"
},
"split_keywords": [
"thoughtful-ocr",
" t-ocr",
" t_ocr"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "91e0d9aec0b22d60bebd5ae0513a2862401d1a5f731a0ba371d882bdbbf81c82",
"md5": "f1e3661c56da94fcd4c855a0c0fa4bd0",
"sha256": "1a816ecc3c1c78e7be47a2268528d4d8f8acd711495ff39ed60155ba12e398e1"
},
"downloads": -1,
"filename": "t_ocr-1.5.3.tar.gz",
"has_sig": false,
"md5_digest": "f1e3661c56da94fcd4c855a0c0fa4bd0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 27629,
"upload_time": "2024-09-18T12:40:01",
"upload_time_iso_8601": "2024-09-18T12:40:01.602912Z",
"url": "https://files.pythonhosted.org/packages/91/e0/d9aec0b22d60bebd5ae0513a2862401d1a5f731a0ba371d882bdbbf81c82/t_ocr-1.5.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-18 12:40:01",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "t-ocr"
}