OCRfixr

Name	OCRfixr JSON
Version	1.5.1 JSON
	download
home_page	https://github.com/ja-mcm/ocrfixr
Summary	A contextual spellchecker for OCR output
upload_time	2023-02-03 19:41:14
maintainer
docs_url	None
author	Jack McMahon
requires_python	>=3.6
license	GNU General Public License v3
keywords	ocrfixr spellcheck ocr contextual bert
VCS
bugtrack_url
requirements	pip flake8 numpy transformers Tensorflow symspellpy importlib_resources metaphone tqdm
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <img src=https://img.shields.io/badge/Python-3.6%2B-blue alt="python versions supported">

# OCRfixr

## OVERVIEW 
This project aims to help automate the challenging work of manually correcting OCR output from Distributed Proofreaders' book digitization projects


## Correcting OCR Misreads
OCRs can sometimes mistake similar-looking characters when scanning a book. For example, "l" and "1" are easily confused, potentially causing the OCR to misread the word "learn" as "1earn".

As written in book: 
> _"The birds flevv south"_

Corrected text:
> _"The birds flew south"_

### How OCRfixr Works:
OCRfixr fixes misreads by checking __1) possible spell corrections__ against the __2) local context__ of the word. For example, here's how OCRfixr would evaluate the following OCR mistake:

As written in book: 
> _"Days there were when small trade came to the __stoie__. Then the young clerk read._"

| Method | Plausible Replacements |
| --------------- | --------------- | 
| Spellcheck (symspellpy) | stone, __store__, stoke, stove, stowe, stole, soie |
| Context (BERT) | market, shop, town, city, __store__, table, village, door, light, markets, surface, place, window, docks, area |

Since there is match for both a plausible spellcheck replacement and that word reasonably matches the context of the sentence, OCRfixr updates the word. 

Corrected text:
> _"Days there were when small trade came to the __store__. Then the young clerk read._"

For very common scanning errors where it is clear what the word should have been (ex: 'onlv' --> 'only'), OCRfixr skips the context check and relies solely on a static mapping of common corrections. This helps to maximize the number of successful edits \& decrease compute time. (You can disable this by setting common_scannos to "F").

### Using OCRfixr

The package can be installed using [pip](https://pypi.org/project/OCRfixr/). 

```bash
pip install OCRfixr
```

By default, OCRfixr only returns the original string, with all changes incorporated:
```python
>>> from ocrfixr import spellcheck

>>> text = "The birds flevv south"
>>> spellcheck(text).fix()
'The birds flew south'
```

Use __return_fixes__ to also include all corrections made to the text, with associated counts for each:
```python
>>> spellcheck(text, return_fixes = "T").fix()
['The birds flew south', {("flevv","flew"):1}]
```

_(Note: OCRfixr resets its BERT context window at the start of each new paragraph, so splitting by paragraph may be a useful debug feature)_


### Interactive Mode
OCRfixr also has an option for the user to interactively accept/reject suggested changes to the text:

```python
>>> text = "The birds flevv down\n south, but wefe quickly apprehended\n by border patrol agents"
>>> spellcheck(text, interactive = "T").fix()
```

<img width="723" alt="Suggestion 1" src="https://user-images.githubusercontent.com/67446041/107133270-7918c300-68b4-11eb-9de5-5b6282510c16.png">

Each suggestion provides the local context around the garbled text, so that the user can determine if the suggestion fits.

<img width="723" alt="Suggestion 2" src="https://user-images.githubusercontent.com/67446041/115068768-af7c4b00-9ec0-11eb-9c7a-65b518718ec4.png">

```python
>>> ### User accepts change to "flevv", but rejects change to "wefe" in GUI
'The birds flew down\n south, but wefe quickly apprehended\n by border patrol agents'
```

This returns the text with all accepted changes reflected. All rejected suggestions are left as-is in the text.

### Command-Line 
OCRfixr is also callable via command-line (intended for Guiguts use):

```python
>>> ocrfixr input_text.txt output_filename.txt
```

The output file will list the line number and position of all suggested changes.


### Avoiding "Damn You, Autocorrect!"
By design, OCRfixr is change-averse:
- If spellcheck/context do not line up, no update is made.
- Likewise, if there is >1 word that lines up for spellcheck/context, no update is made.
- Only the top 15 context suggestions are considered, to limit low-probability matches.
- If the suggestion is a homophone of the original word, it is ignored  (original: coupla --> suggestion: couple). These are assumed to be 'stylistic' or phonetic misspellings
- Proper nouns (anything starting with a capital letter) are not evaluated for spelling.

Word context is drawn from all sentences in the current paragraph (designated by a '\n'), to maximize available information, while also not bogging down the BERT model. 



## Credits

- __symspellpy__ powers spellcheck suggestions
- __transformers__ does the heavy lifting for BERT context modelling
- __DataMunging__ provided a very useful list of common scanning errors 
- __SCOWL__ word list is Copyright 2000-2019 by Kevin Atkinson.
- This project was created to help __Distributed Proofreaders__. Support them here: <https://www.pgdp.net/c/>

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ja-mcm/ocrfixr",
    "name": "OCRfixr",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "ocrfixr,spellcheck,OCR,contextual,BERT",
    "author": "Jack McMahon",
    "author_email": "OCRfixr@mcmahon.work",
    "download_url": "https://files.pythonhosted.org/packages/5a/ee/40f5fcb864530ebecb7393e59db95aea92ac187730ff478cf7b23a35f390/OCRfixr-1.5.1.tar.gz",
    "platform": null,
    "description": "<img src=https://img.shields.io/badge/Python-3.6%2B-blue alt=\"python versions supported\">\n\n# OCRfixr\n\n## OVERVIEW \nThis project aims to help automate the challenging work of manually correcting OCR output from Distributed Proofreaders' book digitization projects\n\n\n## Correcting OCR Misreads\nOCRs can sometimes mistake similar-looking characters when scanning a book. For example, \"l\" and \"1\" are easily confused, potentially causing the OCR to misread the word \"learn\" as \"1earn\".\n\nAs written in book: \n> _\"The birds flevv south\"_\n\nCorrected text:\n> _\"The birds flew south\"_\n\n### How OCRfixr Works:\nOCRfixr fixes misreads by checking __1) possible spell corrections__ against the __2) local context__ of the word. For example, here's how OCRfixr would evaluate the following OCR mistake:\n\nAs written in book: \n> _\"Days there were when small trade came to the __stoie__. Then the young clerk read._\"\n\n| Method | Plausible Replacements |\n| --------------- | --------------- | \n| Spellcheck (symspellpy) | stone, __store__, stoke, stove, stowe, stole, soie |\n| Context (BERT) | market, shop, town, city, __store__, table, village, door, light, markets, surface, place, window, docks, area |\n\nSince there is match for both a plausible spellcheck replacement and that word reasonably matches the context of the sentence, OCRfixr updates the word. \n\nCorrected text:\n> _\"Days there were when small trade came to the __store__. Then the young clerk read._\"\n\nFor very common scanning errors where it is clear what the word should have been (ex: 'onlv' --> 'only'), OCRfixr skips the context check and relies solely on a static mapping of common corrections. This helps to maximize the number of successful edits \\& decrease compute time. (You can disable this by setting common_scannos to \"F\").\n\n### Using OCRfixr\n\nThe package can be installed using [pip](https://pypi.org/project/OCRfixr/). \n\n```bash\npip install OCRfixr\n```\n\nBy default, OCRfixr only returns the original string, with all changes incorporated:\n```python\n>>> from ocrfixr import spellcheck\n\n>>> text = \"The birds flevv south\"\n>>> spellcheck(text).fix()\n'The birds flew south'\n```\n\nUse __return_fixes__ to also include all corrections made to the text, with associated counts for each:\n```python\n>>> spellcheck(text, return_fixes = \"T\").fix()\n['The birds flew south', {(\"flevv\",\"flew\"):1}]\n```\n\n_(Note: OCRfixr resets its BERT context window at the start of each new paragraph, so splitting by paragraph may be a useful debug feature)_\n\n\n### Interactive Mode\nOCRfixr also has an option for the user to interactively accept/reject suggested changes to the text:\n\n```python\n>>> text = \"The birds flevv down\\n south, but wefe quickly apprehended\\n by border patrol agents\"\n>>> spellcheck(text, interactive = \"T\").fix()\n```\n\n<img width=\"723\" alt=\"Suggestion 1\" src=\"https://user-images.githubusercontent.com/67446041/107133270-7918c300-68b4-11eb-9de5-5b6282510c16.png\">\n\nEach suggestion provides the local context around the garbled text, so that the user can determine if the suggestion fits.\n\n<img width=\"723\" alt=\"Suggestion 2\" src=\"https://user-images.githubusercontent.com/67446041/115068768-af7c4b00-9ec0-11eb-9c7a-65b518718ec4.png\">\n\n```python\n>>> ### User accepts change to \"flevv\", but rejects change to \"wefe\" in GUI\n'The birds flew down\\n south, but wefe quickly apprehended\\n by border patrol agents'\n```\n\nThis returns the text with all accepted changes reflected. All rejected suggestions are left as-is in the text.\n\n### Command-Line \nOCRfixr is also callable via command-line (intended for Guiguts use):\n\n```python\n>>> ocrfixr input_text.txt output_filename.txt\n```\n\nThe output file will list the line number and position of all suggested changes.\n\n\n### Avoiding \"Damn You, Autocorrect!\"\nBy design, OCRfixr is change-averse:\n- If spellcheck/context do not line up, no update is made.\n- Likewise, if there is >1 word that lines up for spellcheck/context, no update is made.\n- Only the top 15 context suggestions are considered, to limit low-probability matches.\n- If the suggestion is a homophone of the original word, it is ignored  (original: coupla --> suggestion: couple). These are assumed to be 'stylistic' or phonetic misspellings\n- Proper nouns (anything starting with a capital letter) are not evaluated for spelling.\n\nWord context is drawn from all sentences in the current paragraph (designated by a '\\n'), to maximize available information, while also not bogging down the BERT model. \n\n\n\n## Credits\n\n- __symspellpy__ powers spellcheck suggestions\n- __transformers__ does the heavy lifting for BERT context modelling\n- __DataMunging__ provided a very useful list of common scanning errors \n- __SCOWL__ word list is Copyright 2000-2019 by Kevin Atkinson.\n- This project was created to help __Distributed Proofreaders__. Support them here: <https://www.pgdp.net/c/>\n\n",
    "bugtrack_url": null,
    "license": "GNU General Public License v3",
    "summary": "A contextual spellchecker for OCR output",
    "version": "1.5.1",
    "split_keywords": [
        "ocrfixr",
        "spellcheck",
        "ocr",
        "contextual",
        "bert"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0da80cca3e33942db80a5d5d68090b69fa68dd7ebf3ce5b7cac5c49a6ca7f747",
                "md5": "670f8424c85d351cfd702e5b2917e741",
                "sha256": "94681cfb363910a0703a12601c0c6c21ffaabe32c6f4edd887993d5966003a51"
            },
            "downloads": -1,
            "filename": "OCRfixr-1.5.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "670f8424c85d351cfd702e5b2917e741",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 437364,
            "upload_time": "2023-02-03T19:41:12",
            "upload_time_iso_8601": "2023-02-03T19:41:12.565995Z",
            "url": "https://files.pythonhosted.org/packages/0d/a8/0cca3e33942db80a5d5d68090b69fa68dd7ebf3ce5b7cac5c49a6ca7f747/OCRfixr-1.5.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5aee40f5fcb864530ebecb7393e59db95aea92ac187730ff478cf7b23a35f390",
                "md5": "cc06df89a3dc64689057818e394491b1",
                "sha256": "acb0a2ded5c837bc26be5ab7b20438cf0e188155ba8b167d25a39664875e1131"
            },
            "downloads": -1,
            "filename": "OCRfixr-1.5.1.tar.gz",
            "has_sig": false,
            "md5_digest": "cc06df89a3dc64689057818e394491b1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 438375,
            "upload_time": "2023-02-03T19:41:14",
            "upload_time_iso_8601": "2023-02-03T19:41:14.921702Z",
            "url": "https://files.pythonhosted.org/packages/5a/ee/40f5fcb864530ebecb7393e59db95aea92ac187730ff478cf7b23a35f390/OCRfixr-1.5.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-02-03 19:41:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "ja-mcm",
    "github_project": "ocrfixr",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pip",
            "specs": [
                [
                    "==",
                    "19.2.3"
                ]
            ]
        },
        {
            "name": "flake8",
            "specs": [
                [
                    "==",
                    "3.7.8"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "~=",
                    "1.19.2"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": []
        },
        {
            "name": "Tensorflow",
            "specs": [
                [
                    ">=",
                    "2.0"
                ]
            ]
        },
        {
            "name": "symspellpy",
            "specs": []
        },
        {
            "name": "importlib_resources",
            "specs": []
        },
        {
            "name": "metaphone",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        }
    ],
    "lcname": "ocrfixr"
}

Jack McMahon