pyspellchecker


Namepyspellchecker JSON
Version 0.8.1 PyPI version JSON
download
home_page
SummaryPure python spell checker based on work by Peter Norvig
upload_time2024-01-20 13:05:23
maintainer
docs_urlNone
author
requires_python>=3.7
licenseMIT
keywords python spelling natural language processing nlp typo checker
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            pyspellchecker
===============================================================================

.. image:: https://img.shields.io/badge/license-MIT-blue.svg
    :target: https://opensource.org/licenses/MIT/
    :alt: License
.. image:: https://img.shields.io/github/release/barrust/pyspellchecker.svg
    :target: https://github.com/barrust/pyspellchecker/releases
    :alt: GitHub release
.. image:: https://github.com/barrust/pyspellchecker/workflows/Python%20package/badge.svg
    :target: https://github.com/barrust/pyspellchecker/actions?query=workflow%3A%22Python+package%22
    :alt: Build Status
.. image:: https://codecov.io/gh/barrust/pyspellchecker/branch/master/graph/badge.svg?token=OdETiNgz9k
    :target: https://codecov.io/gh/barrust/pyspellchecker
    :alt: Test Coverage
.. image:: https://badge.fury.io/py/pyspellchecker.svg
    :target: https://badge.fury.io/py/pyspellchecker
    :alt: PyPi Package
.. image:: http://pepy.tech/badge/pyspellchecker
    :target: https://pepy.tech/project/pyspellchecker
    :alt: Downloads


Pure Python Spell Checking based on `Peter
Norvig's <https://norvig.com/spell-correct.html>`__ blog post on setting
up a simple spell checking algorithm.

It uses a `Levenshtein Distance <https://en.wikipedia.org/wiki/Levenshtein_distance>`__
algorithm to find permutations within an edit distance of 2 from the
original word. It then compares all permutations (insertions, deletions,
replacements, and transpositions) to known words in a word frequency
list. Those words that are found more often in the frequency list are
**more likely** the correct results.

``pyspellchecker`` supports multiple languages including English, Spanish,
German, French, Portuguese, Arabic and Basque. For information on how the dictionaries were
created and how they can be updated and improved, please see the
**Dictionary Creation and Updating** section of the readme!

``pyspellchecker`` supports **Python 3**

``pyspellchecker`` allows for the setting of the Levenshtein Distance (up to two) to check.
For longer words, it is highly recommended to use a distance of 1 and not the
default 2. See the quickstart to find how one can change the distance parameter.


Installation
-------------------------------------------------------------------------------

The easiest method to install is using pip:

.. code:: bash

    pip install pyspellchecker

To build from source:

.. code:: bash

    git clone https://github.com/barrust/pyspellchecker.git
    cd pyspellchecker
    python -m build

For *python 2.7* support, install `release 0.5.6 <https://github.com/barrust/pyspellchecker/releases/tag/v0.5.6>`__
but note that no future updates will support *python 2*.

.. code:: bash

    pip install pyspellchecker==0.5.6


Quickstart
-------------------------------------------------------------------------------

After installation, using ``pyspellchecker`` should be fairly straight
forward:

.. code:: python

    from spellchecker import SpellChecker

    spell = SpellChecker()

    # find those words that may be misspelled
    misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])

    for word in misspelled:
        # Get the one `most likely` answer
        print(spell.correction(word))

        # Get a list of `likely` options
        print(spell.candidates(word))


If the Word Frequency list is not to your liking, you can add additional
text to generate a more appropriate list for your use case.

.. code:: python

    from spellchecker import SpellChecker

    spell = SpellChecker()  # loads default word frequency list
    spell.word_frequency.load_text_file('./my_free_text_doc.txt')

    # if I just want to make sure some words are not flagged as misspelled
    spell.word_frequency.load_words(['microsoft', 'apple', 'google'])
    spell.known(['microsoft', 'google'])  # will return both now!


If the words that you wish to check are long, it is recommended to reduce the
`distance` to 1. This can be accomplished either when initializing the spell
check class or after the fact.

.. code:: python

    from spellchecker import SpellChecker

    spell = SpellChecker(distance=1)  # set at initialization

    # do some work on longer words

    spell.distance = 2  # set the distance parameter back to the default


Non-English Dictionaries
-------------------------------------------------------------------------------

``pyspellchecker`` supports several default dictionaries as part of the default
package. Each is simple to use when initializing the dictionary:

.. code:: python

    from spellchecker import SpellChecker

    english = SpellChecker()  # the default is English (language='en')
    spanish = SpellChecker(language='es')  # use the Spanish Dictionary
    russian = SpellChecker(language='ru')  # use the Russian Dictionary
    arabic = SpellChecker(language='ar')   # use the Arabic Dictionary


The currently supported dictionaries are:

* English       - 'en'
* Spanish       - 'es'
* French        - 'fr'
* Portuguese    - 'pt'
* German        - 'de'
* Italian       - 'it'
* Russian       - 'ru'
* Arabic        - 'ar'
* Basque        - 'eu'
* Latvian       - 'lv'
* Dutch         - 'nl'

Dictionary Creation and Updating
-------------------------------------------------------------------------------

The creation of the dictionaries is, unfortunately, not an exact science. I have provided a script that, given a text file of sentences (in this case from
`OpenSubtitles <http://opus.nlpl.eu/OpenSubtitles2018.php>`__) it will generate a word frequency list based on the words found within the text. The script then attempts to ***clean up*** the word frequency by, for example, removing words with invalid characters (usually from other languages), removing low count terms (misspellings?) and attempts to enforce rules as available (no more than one accent per word in Spanish). Then it removes words from a list of known words that are to be removed. It then adds words into the dictionary that are known to be missing or were removed for being too low frequency.

The script can be found here: ``scripts/build_dictionary.py```. The original word frequency list parsed from OpenSubtitles can be found in the ```scripts/data/``` folder along with each language's *include* and *exclude* text files.

Any help in updating and maintaining the dictionaries would be greatly desired. To do this, a
`discussion <https://github.com/barrust/pyspellchecker/discussions>`__ could be started on GitHub or pull requests to update the include and exclude files could be added.


Additional Methods
-------------------------------------------------------------------------------

`On-line documentation <http://pyspellchecker.readthedocs.io/en/latest/>`__ is available; below contains the cliff-notes version of some of the available functions:


``correction(word)``: Returns the most probable result for the
misspelled word

``candidates(word)``: Returns a set of possible candidates for the
misspelled word

``known([words])``: Returns those words that are in the word frequency
list

``unknown([words])``: Returns those words that are not in the frequency
list

``word_probability(word)``: The frequency of the given word out of all
words in the frequency list

The following are less likely to be needed by the user but are available:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``edit_distance_1(word)``: Returns a set of all strings at a Levenshtein
Distance of one based on the alphabet of the selected language

``edit_distance_2(word)``: Returns a set of all strings at a Levenshtein
Distance of two based on the alphabet of the selected language


Credits
-------------------------------------------------------------------------------

* `Peter Norvig <https://norvig.com/spell-correct.html>`__ blog post on setting up a simple spell checking algorithm
* P Lison and J Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "pyspellchecker",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "python,spelling,natural language processing,nlp,typo,checker",
    "author": "",
    "author_email": "Tyler Barrus <barrust@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/5d/f4/5ed7eefd94df484406d2ff29af54daa940ea4cfc265e5add3d2150ff0640/pyspellchecker-0.8.1.tar.gz",
    "platform": null,
    "description": "pyspellchecker\n===============================================================================\n\n.. image:: https://img.shields.io/badge/license-MIT-blue.svg\n    :target: https://opensource.org/licenses/MIT/\n    :alt: License\n.. image:: https://img.shields.io/github/release/barrust/pyspellchecker.svg\n    :target: https://github.com/barrust/pyspellchecker/releases\n    :alt: GitHub release\n.. image:: https://github.com/barrust/pyspellchecker/workflows/Python%20package/badge.svg\n    :target: https://github.com/barrust/pyspellchecker/actions?query=workflow%3A%22Python+package%22\n    :alt: Build Status\n.. image:: https://codecov.io/gh/barrust/pyspellchecker/branch/master/graph/badge.svg?token=OdETiNgz9k\n    :target: https://codecov.io/gh/barrust/pyspellchecker\n    :alt: Test Coverage\n.. image:: https://badge.fury.io/py/pyspellchecker.svg\n    :target: https://badge.fury.io/py/pyspellchecker\n    :alt: PyPi Package\n.. image:: http://pepy.tech/badge/pyspellchecker\n    :target: https://pepy.tech/project/pyspellchecker\n    :alt: Downloads\n\n\nPure Python Spell Checking based on `Peter\nNorvig's <https://norvig.com/spell-correct.html>`__ blog post on setting\nup a simple spell checking algorithm.\n\nIt uses a `Levenshtein Distance <https://en.wikipedia.org/wiki/Levenshtein_distance>`__\nalgorithm to find permutations within an edit distance of 2 from the\noriginal word. It then compares all permutations (insertions, deletions,\nreplacements, and transpositions) to known words in a word frequency\nlist. Those words that are found more often in the frequency list are\n**more likely** the correct results.\n\n``pyspellchecker`` supports multiple languages including English, Spanish,\nGerman, French, Portuguese, Arabic and Basque. For information on how the dictionaries were\ncreated and how they can be updated and improved, please see the\n**Dictionary Creation and Updating** section of the readme!\n\n``pyspellchecker`` supports **Python 3**\n\n``pyspellchecker`` allows for the setting of the Levenshtein Distance (up to two) to check.\nFor longer words, it is highly recommended to use a distance of 1 and not the\ndefault 2. See the quickstart to find how one can change the distance parameter.\n\n\nInstallation\n-------------------------------------------------------------------------------\n\nThe easiest method to install is using pip:\n\n.. code:: bash\n\n    pip install pyspellchecker\n\nTo build from source:\n\n.. code:: bash\n\n    git clone https://github.com/barrust/pyspellchecker.git\n    cd pyspellchecker\n    python -m build\n\nFor *python 2.7* support, install `release 0.5.6 <https://github.com/barrust/pyspellchecker/releases/tag/v0.5.6>`__\nbut note that no future updates will support *python 2*.\n\n.. code:: bash\n\n    pip install pyspellchecker==0.5.6\n\n\nQuickstart\n-------------------------------------------------------------------------------\n\nAfter installation, using ``pyspellchecker`` should be fairly straight\nforward:\n\n.. code:: python\n\n    from spellchecker import SpellChecker\n\n    spell = SpellChecker()\n\n    # find those words that may be misspelled\n    misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])\n\n    for word in misspelled:\n        # Get the one `most likely` answer\n        print(spell.correction(word))\n\n        # Get a list of `likely` options\n        print(spell.candidates(word))\n\n\nIf the Word Frequency list is not to your liking, you can add additional\ntext to generate a more appropriate list for your use case.\n\n.. code:: python\n\n    from spellchecker import SpellChecker\n\n    spell = SpellChecker()  # loads default word frequency list\n    spell.word_frequency.load_text_file('./my_free_text_doc.txt')\n\n    # if I just want to make sure some words are not flagged as misspelled\n    spell.word_frequency.load_words(['microsoft', 'apple', 'google'])\n    spell.known(['microsoft', 'google'])  # will return both now!\n\n\nIf the words that you wish to check are long, it is recommended to reduce the\n`distance` to 1. This can be accomplished either when initializing the spell\ncheck class or after the fact.\n\n.. code:: python\n\n    from spellchecker import SpellChecker\n\n    spell = SpellChecker(distance=1)  # set at initialization\n\n    # do some work on longer words\n\n    spell.distance = 2  # set the distance parameter back to the default\n\n\nNon-English Dictionaries\n-------------------------------------------------------------------------------\n\n``pyspellchecker`` supports several default dictionaries as part of the default\npackage. Each is simple to use when initializing the dictionary:\n\n.. code:: python\n\n    from spellchecker import SpellChecker\n\n    english = SpellChecker()  # the default is English (language='en')\n    spanish = SpellChecker(language='es')  # use the Spanish Dictionary\n    russian = SpellChecker(language='ru')  # use the Russian Dictionary\n    arabic = SpellChecker(language='ar')   # use the Arabic Dictionary\n\n\nThe currently supported dictionaries are:\n\n* English       - 'en'\n* Spanish       - 'es'\n* French        - 'fr'\n* Portuguese    - 'pt'\n* German        - 'de'\n* Italian       - 'it'\n* Russian       - 'ru'\n* Arabic        - 'ar'\n* Basque        - 'eu'\n* Latvian       - 'lv'\n* Dutch         - 'nl'\n\nDictionary Creation and Updating\n-------------------------------------------------------------------------------\n\nThe creation of the dictionaries is, unfortunately, not an exact science. I have provided a script that, given a text file of sentences (in this case from\n`OpenSubtitles <http://opus.nlpl.eu/OpenSubtitles2018.php>`__) it will generate a word frequency list based on the words found within the text. The script then attempts to ***clean up*** the word frequency by, for example, removing words with invalid characters (usually from other languages), removing low count terms (misspellings?) and attempts to enforce rules as available (no more than one accent per word in Spanish). Then it removes words from a list of known words that are to be removed. It then adds words into the dictionary that are known to be missing or were removed for being too low frequency.\n\nThe script can be found here: ``scripts/build_dictionary.py```. The original word frequency list parsed from OpenSubtitles can be found in the ```scripts/data/``` folder along with each language's *include* and *exclude* text files.\n\nAny help in updating and maintaining the dictionaries would be greatly desired. To do this, a\n`discussion <https://github.com/barrust/pyspellchecker/discussions>`__ could be started on GitHub or pull requests to update the include and exclude files could be added.\n\n\nAdditional Methods\n-------------------------------------------------------------------------------\n\n`On-line documentation <http://pyspellchecker.readthedocs.io/en/latest/>`__ is available; below contains the cliff-notes version of some of the available functions:\n\n\n``correction(word)``: Returns the most probable result for the\nmisspelled word\n\n``candidates(word)``: Returns a set of possible candidates for the\nmisspelled word\n\n``known([words])``: Returns those words that are in the word frequency\nlist\n\n``unknown([words])``: Returns those words that are not in the frequency\nlist\n\n``word_probability(word)``: The frequency of the given word out of all\nwords in the frequency list\n\nThe following are less likely to be needed by the user but are available:\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n``edit_distance_1(word)``: Returns a set of all strings at a Levenshtein\nDistance of one based on the alphabet of the selected language\n\n``edit_distance_2(word)``: Returns a set of all strings at a Levenshtein\nDistance of two based on the alphabet of the selected language\n\n\nCredits\n-------------------------------------------------------------------------------\n\n* `Peter Norvig <https://norvig.com/spell-correct.html>`__ blog post on setting up a simple spell checking algorithm\n* P Lison and J Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Pure python spell checker based on work by Peter Norvig",
    "version": "0.8.1",
    "project_urls": {
        "bug-tracker": "https://github.com/barrust/pyspellchecker/issues",
        "documentation": "https://pyspellchecker.readthedocs.io/",
        "homepage": "https://github.com/barrust/pyspellchecker"
    },
    "split_keywords": [
        "python",
        "spelling",
        "natural language processing",
        "nlp",
        "typo",
        "checker"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e1d2c7e3b3a61a34b9320399fa731d1f9f0c73db8a1f28c6764e9e11efa68a29",
                "md5": "4498ab0d71f22b3fe69e2eb8f2d23a22",
                "sha256": "d91e9e1064793ae1ee8e71b06ca40eeb8e5923437c54291a8e041b447792b640"
            },
            "downloads": -1,
            "filename": "pyspellchecker-0.8.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4498ab0d71f22b3fe69e2eb8f2d23a22",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 6771060,
            "upload_time": "2024-01-20T13:05:20",
            "upload_time_iso_8601": "2024-01-20T13:05:20.983883Z",
            "url": "https://files.pythonhosted.org/packages/e1/d2/c7e3b3a61a34b9320399fa731d1f9f0c73db8a1f28c6764e9e11efa68a29/pyspellchecker-0.8.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5df45ed7eefd94df484406d2ff29af54daa940ea4cfc265e5add3d2150ff0640",
                "md5": "fe46bd7f5b94a1b477afc2d4384812ba",
                "sha256": "3478ca8484d1c2db0c93d12b3c986cd17958c69f47b3ed7ef4d3f4201e591776"
            },
            "downloads": -1,
            "filename": "pyspellchecker-0.8.1.tar.gz",
            "has_sig": false,
            "md5_digest": "fe46bd7f5b94a1b477afc2d4384812ba",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 6773020,
            "upload_time": "2024-01-20T13:05:23",
            "upload_time_iso_8601": "2024-01-20T13:05:23.684752Z",
            "url": "https://files.pythonhosted.org/packages/5d/f4/5ed7eefd94df484406d2ff29af54daa940ea4cfc265e5add3d2150ff0640/pyspellchecker-0.8.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-20 13:05:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "barrust",
    "github_project": "pyspellchecker",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pyspellchecker"
}
        
Elapsed time: 0.78464s