natas

Name	natas JSON
Version	1.2.1 JSON
	download
home_page	https://github.com/mikahama/natas
Summary	Python library for processing historical English
upload_time	2024-08-10 15:33:35
maintainer	None
docs_url	None
author	Mika Hämäläinen
requires_python	None
license	Apache 2.0
keywords	historical english spelling normalization ocr error correction
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # NATAS

[![Downloads](https://pepy.tech/badge/natas)](https://pepy.tech/project/natas)


This library has methods for processing historical English corpora, especially for studying neologisms. The first functionalities  relate to normalization of historical spelling and OCR post-correction. This library is maintained by [Mika Hämäläinen](https://mikakalevi.com).


## Installation

Note: It is highly recommended to use a virtual environment because of the strict version requirements for dependencies. The library has been tested with Python 3.6

    pip3 install natas
    python3 -m natas.download
    python3 -m spacy download en_core_web_md

## Historical normalization

For a list of non-modern spelling variants, the tool can produce an ordered list of the candidate normalizations. The candidates are ordered based on the prediction score of the NMT model.

    import natas
    natas.normalize_words(["seacreat", "wiþe"])
    >> [['secret', 'secrete'], ['with', 'withe', 'wide', 'white', 'way']]

Possible keyword arguments are n_best=10, dictionary=None, all_candidates=True, correct_spelling_cache=True, return_scores=False. 
- *n_best* sets the number of candidates the NMT will output
- *dictionary* sets a custom dictionary to be used to filter the NMT output (see more in the next section)
- *all_candidates*, if False, the method will return only the topmost normalization candidate (this will improve the speed of the method)
- *correct_spelling_cache*, used only when checking if a candidate word is correctly spelled. Set this to False if you are testing with multiple *dictionaries*.
- *return_scores*, if True, returns the model's predictions scores together with the normalization candidates. For example [['secret', -1.0969021320343018], ['secrete', -4.121032238006592]]

## OCR post correction

You can use our pretrained model for OCR post correction by doing the following

    import natas
    natas.ocr_correct_words(["paft", "friendlhip"])
    >> [['past', 'pall', 'part', 'part'], ['friendship']]

This will return a list of possible correction candidates in the order of probability according to the NMT model. The same parameters can be used as for historical text normalization.

### Training your own OCR error correction model

You can extract the parallel data for the OCR model if you have an access to a word embeddings model on your OCR data, a list of known correctly spelled words and a vocabulary of the language.

    from natas import ocr_builder
    from natas.normalize import wiktionary
    from gensim.models import Word2Vec

    model = Word2Vec.load("/path/to/your_model.w2v")
    seed_words = set(["logic", "logical"]) #list of correctly spelled words you want to find matching OCR errors for
    dictionary = wiktionary #Lemmas of the English Wiktionary, you will need to change this if working with any other language
    lemmatize = True #Uses Spacy with English model, use natas.set_spacy(nlp) for other models and languages

    results = ocr_builder.extract_parallel(seed_words, model, dictionary=dictionary, lemmatize=lemmatize, use_freq=False)
    >> {"logic": {
        "fyle": 5, 
        "ityle": 5, 
        "lofophy": 5, 
        "logick": 1
    }, 
    "logical": {
        "lofophy": 5, 
        "matical": 3, 
        "phical": 3, 
        "praaical": 4, 
        "pracical": 4, 
        "pratical": 4
    }}

The code results in a dictionary of correctly spelled English words (from *seed_words*) and their mapping to semantically similar non-correctly spelled words (not in *dictionary*). Each non-correct word has a [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) calculated with the correctly spelled word. In our paper, we used 3 as the maximum edit distance.

Use the dictionary to make parallel data files for OpenNMT on a character level. This means splitting the words into letters, such as *l o g i c k* -> *l o g i c*. See [their documentation on how to train the model](https://github.com/OpenNMT/OpenNMT-py).

## Check if a word is correctly spelled

You can check whether a word is correctly spelled easily

    import natas
    natas.is_correctly_spelled("cat")
    natas.is_correctly_spelled("ca7")
    >> True
    >> False

This will compare the word with Wiktionary lemmas with and without Spacy lemmatization. The normalization method depends on this step. By default, *natas* uses Spacy's *en_core_web_md* model. To change this model, do the following

    import natas, spacy
    nlp = spacy.load('en')
    natas.set_spacy(nlp)

If you want to replace the Wiktionary dictionary with another one, it can be passed as a keyword argument. Use *set* instead of *list* for a faster look-up. Notice that the models operate on lowercased words.

    import natas
    my_dictionary= set(["hat", "rat"])
    natas.is_correctly_spelled("cat", dictionary=my_dictionary)
    natas.normalize_words(["ratte"], dictionary=my_dictionary)


By default, caching is enabled. If you want to use the method with multiple different parameters, you will need to set *cache=False*.

    import natas
    natas.is_correctly_spelled("cat") #The word is looked up and the result cached
    natas.is_correctly_spelled("cat") #The result will be served from the cache
    natas.is_correctly_spelled("cat", cache=False) #The word will be looked up again

# Cite

If you use the library, please cite one of the following publications depending on whether you used it for normalization or OCR correction.

## Normalization

Mika Hämäläinen, Tanja Säily, Jack Rueter, Jörg Tiedemann, and Eetu Mäkelä. 2019. [Revisiting NMT for Normalization of Early English Letters](https://www.aclweb.org/anthology/papers/W/W19/W19-2509/). In *Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature*.

    @inproceedings{hamalainen-etal-2019-revisiting,
    title = "Revisiting {NMT} for Normalization of Early {E}nglish Letters",
    author = {H{\"a}m{\"a}l{\"a}inen, Mika  and
      S{\"a}ily, Tanja  and
      Rueter, Jack  and
      Tiedemann, J{\"o}rg  and
      M{\"a}kel{\"a}, Eetu},
    booktitle = "Proceedings of the 3rd Joint {SIGHUM} Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature",
    month = jun,
    year = "2019",
    address = "Minneapolis, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-2509",
    doi = "10.18653/v1/W19-2509",
    pages = "71--75",
    abstract = "This paper studies the use of NMT (neural machine translation) as a normalization method for an early English letter corpus. The corpus has previously been normalized so that only less frequent deviant forms are left out without normalization. This paper discusses different methods for improving the normalization of these deviant forms by using different approaches. Adding features to the training data is found to be unhelpful, but using a lexicographical resource to filter the top candidates produced by the NMT model together with lemmatization improves results.",
    }

## OCR correction

Mika Hämäläinen, and Simon Hengchen. 2019. [From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction](https://www.aclweb.org/anthology/R19-1051/). In *the Proceedings of Recent Advances in Natural Language Processing*.

    @inproceedings{hamalainen-hengchen-2019-paft,
    title = "From the Paft to the Fiiture: a Fully Automatic {NMT} and Word Embeddings Method for {OCR} Post-Correction",
    author = {H{\"a}m{\"a}l{\"a}inen, Mika  and
      Hengchen, Simon},
    booktitle = "Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)",
    month = sep,
    year = "2019",
    address = "Varna, Bulgaria",
    publisher = "INCOMA Ltd.",
    url = "https://www.aclweb.org/anthology/R19-1051",
    doi = "10.26615/978-954-452-056-4_051",
    pages = "431--436",
    abstract = "A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.",
    }

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/mikahama/natas",
    "name": "natas",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "historical English, spelling normalization, OCR error correction",
    "author": "Mika H\u00e4m\u00e4l\u00e4inen",
    "author_email": "mika@flyforpoints.com",
    "download_url": "https://files.pythonhosted.org/packages/72/4b/c41df0950d7b310f07c19a2bbafa319e41760f2e4acd8fd789a33cd91d09/natas-1.2.1.tar.gz",
    "platform": null,
    "description": "# NATAS\n\n[![Downloads](https://pepy.tech/badge/natas)](https://pepy.tech/project/natas)\n\n\nThis library has methods for processing historical English corpora, especially for studying neologisms. The first functionalities  relate to normalization of historical spelling and OCR post-correction. This library is maintained by [Mika H\u00e4m\u00e4l\u00e4inen](https://mikakalevi.com).\n\n\n## Installation\n\nNote: It is highly recommended to use a virtual environment because of the strict version requirements for dependencies. The library has been tested with Python 3.6\n\n    pip3 install natas\n    python3 -m natas.download\n    python3 -m spacy download en_core_web_md\n\n## Historical normalization\n\nFor a list of non-modern spelling variants, the tool can produce an ordered list of the candidate normalizations. The candidates are ordered based on the prediction score of the NMT model.\n\n    import natas\n    natas.normalize_words([\"seacreat\", \"wi\u00fee\"])\n    >> [['secret', 'secrete'], ['with', 'withe', 'wide', 'white', 'way']]\n\nPossible keyword arguments are n_best=10, dictionary=None, all_candidates=True, correct_spelling_cache=True, return_scores=False. \n- *n_best* sets the number of candidates the NMT will output\n- *dictionary* sets a custom dictionary to be used to filter the NMT output (see more in the next section)\n- *all_candidates*, if False, the method will return only the topmost normalization candidate (this will improve the speed of the method)\n- *correct_spelling_cache*, used only when checking if a candidate word is correctly spelled. Set this to False if you are testing with multiple *dictionaries*.\n- *return_scores*, if True, returns the model's predictions scores together with the normalization candidates. For example [['secret', -1.0969021320343018], ['secrete', -4.121032238006592]]\n\n## OCR post correction\n\nYou can use our pretrained model for OCR post correction by doing the following\n\n    import natas\n    natas.ocr_correct_words([\"paft\", \"friendlhip\"])\n    >> [['past', 'pall', 'part', 'part'], ['friendship']]\n\nThis will return a list of possible correction candidates in the order of probability according to the NMT model. The same parameters can be used as for historical text normalization.\n\n### Training your own OCR error correction model\n\nYou can extract the parallel data for the OCR model if you have an access to a word embeddings model on your OCR data, a list of known correctly spelled words and a vocabulary of the language.\n\n    from natas import ocr_builder\n    from natas.normalize import wiktionary\n    from gensim.models import Word2Vec\n\n    model = Word2Vec.load(\"/path/to/your_model.w2v\")\n    seed_words = set([\"logic\", \"logical\"]) #list of correctly spelled words you want to find matching OCR errors for\n    dictionary = wiktionary #Lemmas of the English Wiktionary, you will need to change this if working with any other language\n    lemmatize = True #Uses Spacy with English model, use natas.set_spacy(nlp) for other models and languages\n\n    results = ocr_builder.extract_parallel(seed_words, model, dictionary=dictionary, lemmatize=lemmatize, use_freq=False)\n    >> {\"logic\": {\n        \"fyle\": 5, \n        \"ityle\": 5, \n        \"lofophy\": 5, \n        \"logick\": 1\n    }, \n    \"logical\": {\n        \"lofophy\": 5, \n        \"matical\": 3, \n        \"phical\": 3, \n        \"praaical\": 4, \n        \"pracical\": 4, \n        \"pratical\": 4\n    }}\n\nThe code results in a dictionary of correctly spelled English words (from *seed_words*) and their mapping to semantically similar non-correctly spelled words (not in *dictionary*). Each non-correct word has a [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) calculated with the correctly spelled word. In our paper, we used 3 as the maximum edit distance.\n\nUse the dictionary to make parallel data files for OpenNMT on a character level. This means splitting the words into letters, such as *l o g i c k* -> *l o g i c*. See [their documentation on how to train the model](https://github.com/OpenNMT/OpenNMT-py).\n\n## Check if a word is correctly spelled\n\nYou can check whether a word is correctly spelled easily\n\n    import natas\n    natas.is_correctly_spelled(\"cat\")\n    natas.is_correctly_spelled(\"ca7\")\n    >> True\n    >> False\n\nThis will compare the word with Wiktionary lemmas with and without Spacy lemmatization. The normalization method depends on this step. By default, *natas* uses Spacy's *en_core_web_md* model. To change this model, do the following\n\n    import natas, spacy\n    nlp = spacy.load('en')\n    natas.set_spacy(nlp)\n\nIf you want to replace the Wiktionary dictionary with another one, it can be passed as a keyword argument. Use *set* instead of *list* for a faster look-up. Notice that the models operate on lowercased words.\n\n    import natas\n    my_dictionary= set([\"hat\", \"rat\"])\n    natas.is_correctly_spelled(\"cat\", dictionary=my_dictionary)\n    natas.normalize_words([\"ratte\"], dictionary=my_dictionary)\n\n\nBy default, caching is enabled. If you want to use the method with multiple different parameters, you will need to set *cache=False*.\n\n    import natas\n    natas.is_correctly_spelled(\"cat\") #The word is looked up and the result cached\n    natas.is_correctly_spelled(\"cat\") #The result will be served from the cache\n    natas.is_correctly_spelled(\"cat\", cache=False) #The word will be looked up again\n\n# Cite\n\nIf you use the library, please cite one of the following publications depending on whether you used it for normalization or OCR correction.\n\n## Normalization\n\nMika Ha\u0308ma\u0308la\u0308inen, Tanja Sa\u0308ily, Jack Rueter, Jo\u0308rg Tiedemann, and Eetu Ma\u0308kela\u0308. 2019. [Revisiting NMT for Normalization of Early English Letters](https://www.aclweb.org/anthology/papers/W/W19/W19-2509/). In *Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature*.\n\n    @inproceedings{hamalainen-etal-2019-revisiting,\n    title = \"Revisiting {NMT} for Normalization of Early {E}nglish Letters\",\n    author = {H{\\\"a}m{\\\"a}l{\\\"a}inen, Mika  and\n      S{\\\"a}ily, Tanja  and\n      Rueter, Jack  and\n      Tiedemann, J{\\\"o}rg  and\n      M{\\\"a}kel{\\\"a}, Eetu},\n    booktitle = \"Proceedings of the 3rd Joint {SIGHUM} Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature\",\n    month = jun,\n    year = \"2019\",\n    address = \"Minneapolis, USA\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/W19-2509\",\n    doi = \"10.18653/v1/W19-2509\",\n    pages = \"71--75\",\n    abstract = \"This paper studies the use of NMT (neural machine translation) as a normalization method for an early English letter corpus. The corpus has previously been normalized so that only less frequent deviant forms are left out without normalization. This paper discusses different methods for improving the normalization of these deviant forms by using different approaches. Adding features to the training data is found to be unhelpful, but using a lexicographical resource to filter the top candidates produced by the NMT model together with lemmatization improves results.\",\n    }\n\n## OCR correction\n\nMika H\u00e4m\u00e4l\u00e4inen, and Simon Hengchen. 2019. [From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction](https://www.aclweb.org/anthology/R19-1051/). In *the Proceedings of Recent Advances in Natural Language Processing*.\n\n    @inproceedings{hamalainen-hengchen-2019-paft,\n    title = \"From the Paft to the Fiiture: a Fully Automatic {NMT} and Word Embeddings Method for {OCR} Post-Correction\",\n    author = {H{\\\"a}m{\\\"a}l{\\\"a}inen, Mika  and\n      Hengchen, Simon},\n    booktitle = \"Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)\",\n    month = sep,\n    year = \"2019\",\n    address = \"Varna, Bulgaria\",\n    publisher = \"INCOMA Ltd.\",\n    url = \"https://www.aclweb.org/anthology/R19-1051\",\n    doi = \"10.26615/978-954-452-056-4_051\",\n    pages = \"431--436\",\n    abstract = \"A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.\",\n    }\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Python library for processing historical English",
    "version": "1.2.1",
    "project_urls": {
        "Bug Reports": "https://github.com/mikahama/natas/issues",
        "Developer": "https://mikakalevi.com/",
        "Homepage": "https://github.com/mikahama/natas"
    },
    "split_keywords": [
        "historical english",
        " spelling normalization",
        " ocr error correction"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3a4e7acba65dc33e6563b05de73ae01f82e0af747d4e091426fdaeb6b0a98f25",
                "md5": "89adfc3a0bcbbe69f950fcff0bcbc28b",
                "sha256": "e05fe9efa98e2e68cd0a883e479382f3b0627a8872fded5b6b5d2675f1255813"
            },
            "downloads": -1,
            "filename": "natas-1.2.1-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "89adfc3a0bcbbe69f950fcff0bcbc28b",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 1818163,
            "upload_time": "2024-08-10T15:33:30",
            "upload_time_iso_8601": "2024-08-10T15:33:30.577406Z",
            "url": "https://files.pythonhosted.org/packages/3a/4e/7acba65dc33e6563b05de73ae01f82e0af747d4e091426fdaeb6b0a98f25/natas-1.2.1-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "724bc41df0950d7b310f07c19a2bbafa319e41760f2e4acd8fd789a33cd91d09",
                "md5": "770049ccfab4eac9b5ca346bc3117754",
                "sha256": "bd8894271c1902586c39cd1d6879ededb0520109d8fc81a66b46d25d11bbbbc9"
            },
            "downloads": -1,
            "filename": "natas-1.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "770049ccfab4eac9b5ca346bc3117754",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 1814867,
            "upload_time": "2024-08-10T15:33:35",
            "upload_time_iso_8601": "2024-08-10T15:33:35.440670Z",
            "url": "https://files.pythonhosted.org/packages/72/4b/c41df0950d7b310f07c19a2bbafa319e41760f2e4acd8fd789a33cd91d09/natas-1.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-10 15:33:35",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mikahama",
    "github_project": "natas",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "natas"
}

Mika Hämäläinen