<img src=https://img.shields.io/badge/Python-3.6%2B-blue alt="python versions supported">
# OCRfixr
## OVERVIEW
This project aims to help automate the challenging work of manually correcting OCR output from Distributed Proofreaders' book digitization projects
## Correcting OCR Misreads
OCRs can sometimes mistake similar-looking characters when scanning a book. For example, "l" and "1" are easily confused, potentially causing the OCR to misread the word "learn" as "1earn".
As written in book:
> _"The birds flevv south"_
Corrected text:
> _"The birds flew south"_
### How OCRfixr Works:
OCRfixr fixes misreads by checking __1) possible spell corrections__ against the __2) local context__ of the word. For example, here's how OCRfixr would evaluate the following OCR mistake:
As written in book:
> _"Days there were when small trade came to the __stoie__. Then the young clerk read._"
| Method | Plausible Replacements |
| --------------- | --------------- |
| Spellcheck (symspellpy) | stone, __store__, stoke, stove, stowe, stole, soie |
| Context (BERT) | market, shop, town, city, __store__, table, village, door, light, markets, surface, place, window, docks, area |
Since there is match for both a plausible spellcheck replacement and that word reasonably matches the context of the sentence, OCRfixr updates the word.
Corrected text:
> _"Days there were when small trade came to the __store__. Then the young clerk read._"
For very common scanning errors where it is clear what the word should have been (ex: 'onlv' --> 'only'), OCRfixr skips the context check and relies solely on a static mapping of common corrections. This helps to maximize the number of successful edits \& decrease compute time. (You can disable this by setting common_scannos to "F").
### Using OCRfixr
The package can be installed using [pip](https://pypi.org/project/OCRfixr/).
```bash
pip install OCRfixr
```
By default, OCRfixr only returns the original string, with all changes incorporated:
```python
>>> from ocrfixr import spellcheck
>>> text = "The birds flevv south"
>>> spellcheck(text).fix()
'The birds flew south'
```
Use __return_fixes__ to also include all corrections made to the text, with associated counts for each:
```python
>>> spellcheck(text, return_fixes = "T").fix()
['The birds flew south', {("flevv","flew"):1}]
```
_(Note: OCRfixr resets its BERT context window at the start of each new paragraph, so splitting by paragraph may be a useful debug feature)_
### Interactive Mode
OCRfixr also has an option for the user to interactively accept/reject suggested changes to the text:
```python
>>> text = "The birds flevv down\n south, but wefe quickly apprehended\n by border patrol agents"
>>> spellcheck(text, interactive = "T").fix()
```
<img width="723" alt="Suggestion 1" src="https://user-images.githubusercontent.com/67446041/107133270-7918c300-68b4-11eb-9de5-5b6282510c16.png">
Each suggestion provides the local context around the garbled text, so that the user can determine if the suggestion fits.
<img width="723" alt="Suggestion 2" src="https://user-images.githubusercontent.com/67446041/115068768-af7c4b00-9ec0-11eb-9c7a-65b518718ec4.png">
```python
>>> ### User accepts change to "flevv", but rejects change to "wefe" in GUI
'The birds flew down\n south, but wefe quickly apprehended\n by border patrol agents'
```
This returns the text with all accepted changes reflected. All rejected suggestions are left as-is in the text.
### Command-Line
OCRfixr is also callable via command-line (intended for Guiguts use):
```python
>>> ocrfixr input_text.txt output_filename.txt
```
The output file will list the line number and position of all suggested changes.
### Avoiding "Damn You, Autocorrect!"
By design, OCRfixr is change-averse:
- If spellcheck/context do not line up, no update is made.
- Likewise, if there is >1 word that lines up for spellcheck/context, no update is made.
- Only the top 15 context suggestions are considered, to limit low-probability matches.
- If the suggestion is a homophone of the original word, it is ignored (original: coupla --> suggestion: couple). These are assumed to be 'stylistic' or phonetic misspellings
- Proper nouns (anything starting with a capital letter) are not evaluated for spelling.
Word context is drawn from all sentences in the current paragraph (designated by a '\n'), to maximize available information, while also not bogging down the BERT model.
## Credits
- __symspellpy__ powers spellcheck suggestions
- __transformers__ does the heavy lifting for BERT context modelling
- __DataMunging__ provided a very useful list of common scanning errors
- __SCOWL__ word list is Copyright 2000-2019 by Kevin Atkinson.
- This project was created to help __Distributed Proofreaders__. Support them here: <https://www.pgdp.net/c/>
Raw data
{
"_id": null,
"home_page": "https://github.com/ja-mcm/ocrfixr",
"name": "OCRfixr",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "ocrfixr,spellcheck,OCR,contextual,BERT",
"author": "Jack McMahon",
"author_email": "OCRfixr@mcmahon.work",
"download_url": "https://files.pythonhosted.org/packages/5a/ee/40f5fcb864530ebecb7393e59db95aea92ac187730ff478cf7b23a35f390/OCRfixr-1.5.1.tar.gz",
"platform": null,
"description": "<img src=https://img.shields.io/badge/Python-3.6%2B-blue alt=\"python versions supported\">\n\n# OCRfixr\n\n## OVERVIEW \nThis project aims to help automate the challenging work of manually correcting OCR output from Distributed Proofreaders' book digitization projects\n\n\n## Correcting OCR Misreads\nOCRs can sometimes mistake similar-looking characters when scanning a book. For example, \"l\" and \"1\" are easily confused, potentially causing the OCR to misread the word \"learn\" as \"1earn\".\n\nAs written in book: \n> _\"The birds flevv south\"_\n\nCorrected text:\n> _\"The birds flew south\"_\n\n### How OCRfixr Works:\nOCRfixr fixes misreads by checking __1) possible spell corrections__ against the __2) local context__ of the word. For example, here's how OCRfixr would evaluate the following OCR mistake:\n\nAs written in book: \n> _\"Days there were when small trade came to the __stoie__. Then the young clerk read._\"\n\n| Method | Plausible Replacements |\n| --------------- | --------------- | \n| Spellcheck (symspellpy) | stone, __store__, stoke, stove, stowe, stole, soie |\n| Context (BERT) | market, shop, town, city, __store__, table, village, door, light, markets, surface, place, window, docks, area |\n\nSince there is match for both a plausible spellcheck replacement and that word reasonably matches the context of the sentence, OCRfixr updates the word. \n\nCorrected text:\n> _\"Days there were when small trade came to the __store__. Then the young clerk read._\"\n\nFor very common scanning errors where it is clear what the word should have been (ex: 'onlv' --> 'only'), OCRfixr skips the context check and relies solely on a static mapping of common corrections. This helps to maximize the number of successful edits \\& decrease compute time. (You can disable this by setting common_scannos to \"F\").\n\n### Using OCRfixr\n\nThe package can be installed using [pip](https://pypi.org/project/OCRfixr/). \n\n```bash\npip install OCRfixr\n```\n\nBy default, OCRfixr only returns the original string, with all changes incorporated:\n```python\n>>> from ocrfixr import spellcheck\n\n>>> text = \"The birds flevv south\"\n>>> spellcheck(text).fix()\n'The birds flew south'\n```\n\nUse __return_fixes__ to also include all corrections made to the text, with associated counts for each:\n```python\n>>> spellcheck(text, return_fixes = \"T\").fix()\n['The birds flew south', {(\"flevv\",\"flew\"):1}]\n```\n\n_(Note: OCRfixr resets its BERT context window at the start of each new paragraph, so splitting by paragraph may be a useful debug feature)_\n\n\n### Interactive Mode\nOCRfixr also has an option for the user to interactively accept/reject suggested changes to the text:\n\n```python\n>>> text = \"The birds flevv down\\n south, but wefe quickly apprehended\\n by border patrol agents\"\n>>> spellcheck(text, interactive = \"T\").fix()\n```\n\n<img width=\"723\" alt=\"Suggestion 1\" src=\"https://user-images.githubusercontent.com/67446041/107133270-7918c300-68b4-11eb-9de5-5b6282510c16.png\">\n\nEach suggestion provides the local context around the garbled text, so that the user can determine if the suggestion fits.\n\n<img width=\"723\" alt=\"Suggestion 2\" src=\"https://user-images.githubusercontent.com/67446041/115068768-af7c4b00-9ec0-11eb-9c7a-65b518718ec4.png\">\n\n```python\n>>> ### User accepts change to \"flevv\", but rejects change to \"wefe\" in GUI\n'The birds flew down\\n south, but wefe quickly apprehended\\n by border patrol agents'\n```\n\nThis returns the text with all accepted changes reflected. All rejected suggestions are left as-is in the text.\n\n### Command-Line \nOCRfixr is also callable via command-line (intended for Guiguts use):\n\n```python\n>>> ocrfixr input_text.txt output_filename.txt\n```\n\nThe output file will list the line number and position of all suggested changes.\n\n\n### Avoiding \"Damn You, Autocorrect!\"\nBy design, OCRfixr is change-averse:\n- If spellcheck/context do not line up, no update is made.\n- Likewise, if there is >1 word that lines up for spellcheck/context, no update is made.\n- Only the top 15 context suggestions are considered, to limit low-probability matches.\n- If the suggestion is a homophone of the original word, it is ignored (original: coupla --> suggestion: couple). These are assumed to be 'stylistic' or phonetic misspellings\n- Proper nouns (anything starting with a capital letter) are not evaluated for spelling.\n\nWord context is drawn from all sentences in the current paragraph (designated by a '\\n'), to maximize available information, while also not bogging down the BERT model. \n\n\n\n## Credits\n\n- __symspellpy__ powers spellcheck suggestions\n- __transformers__ does the heavy lifting for BERT context modelling\n- __DataMunging__ provided a very useful list of common scanning errors \n- __SCOWL__ word list is Copyright 2000-2019 by Kevin Atkinson.\n- This project was created to help __Distributed Proofreaders__. Support them here: <https://www.pgdp.net/c/>\n\n",
"bugtrack_url": null,
"license": "GNU General Public License v3",
"summary": "A contextual spellchecker for OCR output",
"version": "1.5.1",
"split_keywords": [
"ocrfixr",
"spellcheck",
"ocr",
"contextual",
"bert"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0da80cca3e33942db80a5d5d68090b69fa68dd7ebf3ce5b7cac5c49a6ca7f747",
"md5": "670f8424c85d351cfd702e5b2917e741",
"sha256": "94681cfb363910a0703a12601c0c6c21ffaabe32c6f4edd887993d5966003a51"
},
"downloads": -1,
"filename": "OCRfixr-1.5.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "670f8424c85d351cfd702e5b2917e741",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 437364,
"upload_time": "2023-02-03T19:41:12",
"upload_time_iso_8601": "2023-02-03T19:41:12.565995Z",
"url": "https://files.pythonhosted.org/packages/0d/a8/0cca3e33942db80a5d5d68090b69fa68dd7ebf3ce5b7cac5c49a6ca7f747/OCRfixr-1.5.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "5aee40f5fcb864530ebecb7393e59db95aea92ac187730ff478cf7b23a35f390",
"md5": "cc06df89a3dc64689057818e394491b1",
"sha256": "acb0a2ded5c837bc26be5ab7b20438cf0e188155ba8b167d25a39664875e1131"
},
"downloads": -1,
"filename": "OCRfixr-1.5.1.tar.gz",
"has_sig": false,
"md5_digest": "cc06df89a3dc64689057818e394491b1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 438375,
"upload_time": "2023-02-03T19:41:14",
"upload_time_iso_8601": "2023-02-03T19:41:14.921702Z",
"url": "https://files.pythonhosted.org/packages/5a/ee/40f5fcb864530ebecb7393e59db95aea92ac187730ff478cf7b23a35f390/OCRfixr-1.5.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-02-03 19:41:14",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "ja-mcm",
"github_project": "ocrfixr",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "pip",
"specs": [
[
"==",
"19.2.3"
]
]
},
{
"name": "flake8",
"specs": [
[
"==",
"3.7.8"
]
]
},
{
"name": "numpy",
"specs": [
[
"~=",
"1.19.2"
]
]
},
{
"name": "transformers",
"specs": []
},
{
"name": "Tensorflow",
"specs": [
[
">=",
"2.0"
]
]
},
{
"name": "symspellpy",
"specs": []
},
{
"name": "importlib_resources",
"specs": []
},
{
"name": "metaphone",
"specs": []
},
{
"name": "tqdm",
"specs": []
}
],
"lcname": "ocrfixr"
}