neuspellmyntra

Name	neuspellmyntra JSON
Version	1.0.0 JSON
	download
home_page	https://github.com/neuspell/neuspell
Summary	NeuSpell: A Neural Spelling Correction Toolkit
upload_time	2024-09-19 14:24:01
maintainer	None
docs_url	None
author	Sai Muralidhar Jayanthi, Danish Pruthi, and Graham Neubig
requires_python	>3.5
license	MIT
keywords	transformer networks neuspell neural spelling correction embedding pytorch nlp deep learning
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <h1 align="center">
<p>NeuSpell: A Neural Spelling Correction Toolkit
</h1>

# Contents

- [Installation & Quick Start](#Installation)
- Toolkit
    - [Introduction](#Introduction)
    - [Download Checkpoints](#Download-Checkpoints)
    - [Download Datasets](#Datasets)
    - [Demo Setup](#Demo-Setup)
    - [Text Noising](#Synthetic-data-creation)
- [Finetuning on custom data and creating new models](#Finetuning-on-custom-data-and-creating-new-models)
- [Applications](#Potential-applications-for-practitioners)
- [Additional Requirements](#Additional-requirements)

# Updates

### Latest

- April 2021:
    - APIs for creating synthetic data now available for English language.
      See [Synthetic data creation](#Synthetic-data-creation).
    - `neuspell` is now available through **pip**. See [Installation through pip](#Installation-through-pip)
    - Added support for different transformer-based models such DistilBERT, XLM-RoBERTa, etc.
      See [Finetuning on custom data and creating new models](#Finetuning-on-custom-data-and-creating-new-models)
      section for more details.

### Previous

- March, 2021:
    - Code-base reformatted. Addressed bug fixes and issues.
- November, 2020:
    - Neuspell's ```BERT``` pretrained model is now available as part of huggingface models
      as ```murali1996/bert-base-cased-spell-correction```. We provide an example code snippet
      at [./scripts/huggingface](./scripts/huggingface/huggingface-snippet-for-neuspell.py) for curious practitioners.
- September, 2020:
    - This work is accepted at EMNLP 2020 (system demonstrations)

# Installation

```bash
git clone https://github.com/neuspell/neuspell; cd neuspell
pip install -e .
```

To install extra requirements,

```bash
pip install -r extras-requirements.txt
```

or individually as:

```bash
pip install -e .[elmo]
pip install -e .[spacy]
```

NOTE: For _zsh_, use ".[elmo]" and ".[spacy]" instead

Additionally, ```spacy models``` can be downloaded as:

```bash
python -m spacy download en_core_web_sm
```

Then, download pretrained models of `neuspell` following [Download Checkpoints](#Download-Checkpoints)

Here is a quick-start code snippet (command line usage) to use a checker model.
See [test_neuspell_correctors.py](./tests/test_neuspell_correctors.py) for more usage patterns.

```python
import neuspell
from neuspell import available_checkers, BertChecker

""" see available checkers """
print(f"available checkers: {neuspell.available_checkers()}")
# → available checkers: ['BertsclstmChecker', 'CnnlstmChecker', 'NestedlstmChecker', 'SclstmChecker', 'SclstmbertChecker', 'BertChecker', 'SclstmelmoChecker', 'ElmosclstmChecker']

""" select spell checkers & load """
checker = BertChecker()
checker.from_pretrained()

""" spell correction """
checker.correct("I luk foward to receving your reply")
# → "I look forward to receiving your reply"
checker.correct_strings(["I luk foward to receving your reply", ])
# → ["I look forward to receiving your reply"]
checker.correct_from_file(src="noisy_texts.txt")
# → "Found 450 mistakes in 322 lines, total_lines=350"

""" evaluation of models """
checker.evaluate(clean_file="bea60k.txt", corrupt_file="bea60k.noise.txt")
# → data size: 63044
# → total inference time for this data is: 998.13 secs
# → total token count: 1032061
# → confusion table: corr2corr:940937, corr2incorr:21060,
#                    incorr2corr:55889, incorr2incorr:14175
# → accuracy is 96.58%
# → word correction rate is 79.76%
```

Alternatively, once can also select and load a spell checker differently as follows:

```python
from neuspell import SclstmChecker

checker = SclstmChecker()
checker = checker.add_("elmo", at="input")  # "elmo" or "bert", "input" or "output"
checker.from_pretrained()
```

This feature of adding ELMO or BERT model is currently supported for selected models.
See [List of neural models in the toolkit](#List-of-neural-models-in-the-toolkit) for details.

If interested, follow [Additional Requirements](#Additional-requirements) for installing non-neural spell
checkers- ```Aspell``` and ```Jamspell```.

### Installation through pip

```bash
pip install neuspell
```

In v1.0, `allennlp` library is not automatically installed which is used for models containing ELMO. Hence, to utilize
those checkers, do a source install as in [Installation & Quick Start](#Installation)

# Toolkit

### Introduction

NeuSpell is an open-source toolkit for context sensitive spelling correction in English. This toolkit comprises of 10
spell checkers, with evaluations on naturally occurring mis-spellings from multiple (publicly available) sources. To
make neural models for spell checking context dependent, (i) we train neural models using spelling errors in context,
synthetically constructed by reverse engineering isolated mis-spellings; and  (ii) use richer representations of the
context.This toolkit enables NLP practitioners to use our proposed and existing spelling correction systems, both via a
simple unified command line, as well as a web interface. Among many potential applications, we demonstrate the utility
of our spell-checkers in combating adversarial misspellings.

##### Live demo available at <http://neuspell.github.io/>

<p align="center">
    <br>
    <img src="https://github.com/neuspell/neuspell/blob/master/images/ui.png?raw=true" width="400"/>
    <br>
<p>

##### List of neural models in the toolkit:

- [```CNN-LSTM```](https://drive.google.com/file/d/14XiDY4BJ144fVGE2cfWfwyjnMwBcwhNa/view?usp=sharing)
- [```SC-LSTM```](https://drive.google.com/file/d/1OvbkdBXawnefQF1d-tUrd9lxiAH1ULtr/view?usp=sharing)
- [```Nested-LSTM```](https://drive.google.com/file/d/19ZhWvBaZqrsP5cGqBJdFPtufdyBqQprI/view?usp=sharing)
- [```BERT```](https://huggingface.co/transformers/bertology.html)
- [```SC-LSTM plus ELMO (at input)```](https://drive.google.com/file/d/1mjLFuQ0vWOOpPqTVkFZ_MSHiuVUmgHSK/view?usp=sharing)
- [```SC-LSTM plus ELMO (at output)```](https://drive.google.com/file/d/1P8vX9ByOBQpN9oeho_iOJmFJByv1ifI5/view?usp=sharing)
- [```SC-LSTM plus BERT (at input)```](https://huggingface.co/transformers/bertology.html)
- [```SC-LSTM plus BERT (at output)```](https://huggingface.co/transformers/bertology.html)

<p align="center">
    <br>
    <img src="https://github.com/neuspell/neuspell/blob/master/images/pipeline.jpeg?raw=true" width="400"/>
    <br>
    This pipeline corresponds to the `SC-LSTM plus ELMO (at input)` model.
<p>

##### Performances

| Spell<br>Checker    | Word<br>Correction <br>Rate | Time per<br>sentence <br>(in milliseconds) |
|-------------------------------------|-----------------------|--------------------------------------|
| ```Aspell```                        | 48.7                  | 7.3*                                 |
| ``` Jamspell```                     | 68.9                  | 2.6*                                 |
| ```CNN-LSTM```                      | 75.8                  | 4.2                                  |
| ```SC-LSTM```                       | 76.7                  | 2.8                                  |
| ```Nested-LSTM```                   | 77.3                  | 6.4                                  |
| ```BERT```                          | 79.1                  | 7.1                                  |
| ```SC-LSTM plus ELMO (at input)```  |  79.8                 | 15.8                                 |
| ```SC-LSTM plus ELMO (at output)``` | 78.5                  | 16.3                                 |
| ```SC-LSTM plus BERT (at input)```  | 77.0                  | 6.7                                  |
| ```SC-LSTM plus BERT (at output)``` | 76.0                  | 7.2                                  |

Performance of different correctors in the NeuSpell toolkit on the  ```BEA-60K```  dataset with real-world spelling
mistakes. ∗ indicates evaluation on a CPU (for others we use a GeForce RTX 2080 Ti GPU).

### Download Checkpoints

To download selected checkpoints, select a **Checkpoint name** from below and then run download. Each checkpoint is
associated with a neural spell checker as shown in the table.

| Spell Checker                       | Class               | Checkpoint name             | Disk space (approx.) |
|-------------------------------------|---------------------|-----------------------------|----------------------|
| ```CNN-LSTM```                      | `CnnlstmChecker`    | 'cnn-lstm-probwordnoise'    | 450 MB               |
| ```SC-LSTM```                       | `SclstmChecker`     | 'scrnn-probwordnoise'       | 450 MB               |
| ```Nested-LSTM```                   | `NestedlstmChecker` | 'lstm-lstm-probwordnoise'   | 455 MB               |
| ```BERT```                          | `BertChecker`       | 'subwordbert-probwordnoise' | 740 MB               |
| ```SC-LSTM plus ELMO (at input)```  | `ElmosclstmChecker` | 'elmoscrnn-probwordnoise'   | 840 MB               |
| ```SC-LSTM plus BERT (at input)```  | `BertsclstmChecker` | 'bertscrnn-probwordnoise'   | 900 MB               |
| ```SC-LSTM plus BERT (at output)``` | `SclstmbertChecker` | 'scrnnbert-probwordnoise'   | 1.19 GB              |
| ```SC-LSTM plus ELMO (at output)``` | `SclstmelmoChecker` | 'scrnnelmo-probwordnoise'   | 1.23 GB              |

```python
import neuspell

neuspell.seq_modeling.downloads.download_pretrained_model("subwordbert-probwordnoise")
```

Alternatively, download all Neuspell neural models by running the following (available in versions after v1.0):

```python
import neuspell

neuspell.seq_modeling.downloads.download_pretrained_model("_all_")
```

Alternatively,

### Datasets

We curate several synthetic and natural datasets for training/evaluating neuspell models. For full details, check
our [paper](#Citation). Run the following to download all the datasets.

```
cd data/traintest
python download_datafiles.py 
```

See ```data/traintest/README.md``` for more details.

Train files are dubbed with names ```.random```, ```.word```, ```.prob```, ```.probword``` for different noising
startegies used to create them. For each strategy (see [Synthetic data creation](#Synthetic-data-creation)), we noise
∼20% of the tokens in the clean corpus. We use 1.6 Million sentences from
the [```One billion word benchmark```](https://arxiv.org/abs/1312.3005) dataset as our clean corpus.

### Demo Setup

In order to setup a demo, follow these steps:

- Do [Installation](#Installation) and then install flask requirements as  ```pip install -e ".[flask]"```
- Download [checkpoints](#Pretrained-models) (__Note__: If you wish to use only one of the neural checkers, you need to
  manually disable others in the imports of [./scripts/flask-server/app.py](./scripts/flask-server/app.py))
- Start a flask server in folder [./scripts/flask-server](./scripts/flask-server) by
  running `CUDA_VISIBLE_DEVICES=0 python app.py`
  (on GPU) or `python app.py` (on CPU)

### Synthetic data creation

##### English

This toolkit offers 3 kinds of noising strategies (identfied from existing literature) to generate synthetic parallel
training data to train neural models for spell correction. The strategies include a simple lookup based noisy spelling
replacement (`en-word-replacement-noise`), a character level noise induction such as swapping/deleting/adding/replacing
characters (`en-char-replacement-noise`), and a confusion matrix based probabilistic character replacement driven by
mistakes patterns in a large corpus of spelling mistakes (`en-probchar-replacement-noise`). For full details about these
approaches, checkout our [paper](#Citation).

Following are the corresponding class mappings to utilize the above noise curations. As some pre-built data files are
used for some of the noisers, we also provide their approximate disk space.

| Folder                          | Class name                                | Disk space (approx.) |
|---------------------------------|-------------------------------------------|----------------------|
| `en-word-replacement-noise`     | `WordReplacementNoiser`                   | 2 MB                 |
| `en-char-replacement-noise`     | `CharacterReplacementNoiser`              | --                   |
| `en-probchar-replacement-noise` | `ProbabilisticCharacterReplacementNoiser` | 80 MB                |

Following is a snippet for using these noisers-

```python
from neuspell.noising import WordReplacementNoiser

example_texts = [
    "This is an example sentence to demonstrate noising in the neuspell repository.",
    "Here is another such amazing example !!"
]

word_repl_noiser = WordReplacementNoiser(language="english")
word_repl_noiser.load_resources()
noise_texts = word_repl_noiser.noise(example_texts)
print(noise_texts)
```

##### Other languages

```
Coming Soon ...
```

# Finetuning on custom data and creating new models

### Finetuning on top of `neuspell` pretrained models

```python
from neuspell import BertChecker

checker = BertChecker()
checker.from_pretrained()
checker.finetune(clean_file="sample_clean.txt", corrupt_file="sample_corrupt.txt", data_dir="default")
```

This feature is only available for `BertChecker` and `ElmosclstmChecker`.

### Training other Transformers/BERT-based models

We now support initializing a huggingface model and finetuning it on your custom data. Here is a code snippet
demonstrating that:

First mark your files containing clean and corrupt texts in a line-seperated format

```python
from neuspell.commons import DEFAULT_TRAINTEST_DATA_PATH

data_dir = DEFAULT_TRAINTEST_DATA_PATH
clean_file = "sample_clean.txt"
corrupt_file = "sample_corrupt.txt"
```

```python
from neuspell.seq_modeling.helpers import load_data, train_validation_split
from neuspell.seq_modeling.helpers import get_tokens
from neuspell import BertChecker

# Step-0: Load your train and test files, create a validation split
train_data = load_data(data_dir, clean_file, corrupt_file)
train_data, valid_data = train_validation_split(train_data, 0.8, seed=11690)

# Step-1: Create vocab file. This serves as the target vocab file and we use the defined model's default huggingface
# tokenizer to tokenize inputs appropriately.
vocab = get_tokens([i[0] for i in train_data], keep_simple=True, min_max_freq=(1, float("inf")), topk=100000)

# # Step-2: Initialize a model
checker = BertChecker(device="cuda")
checker.from_huggingface(bert_pretrained_name_or_path="distilbert-base-cased", vocab=vocab)

# Step-3: Finetune the model on your dataset
checker.finetune(clean_file=clean_file, corrupt_file=corrupt_file, data_dir=data_dir)
```

You can further evaluate your model on a custom data as follows:

```python
from neuspell import BertChecker

checker = BertChecker()
checker.from_pretrained(
    bert_pretrained_name_or_path="distilbert-base-cased",
    ckpt_path=f"{data_dir}/new_models/distilbert-base-cased"  # "<folder where the model is saved>"
)
checker.evaluate(clean_file=clean_file, corrupt_file=corrupt_file, data_dir=data_dir)
```

### Multilingual Models

Following usage above, once can now seamlessly utilize multilingual models such as `xlm-roberta-base`,
`bert-base-multilingual-cased` and `distilbert-base-multilingual-cased` on a non-English script.

# Potential applications for practitioners

- Defenses against adversarial attacks in NLP
    - example implementation available in folder ```./applications/Adversarial-Misspellings-arxiv```.
      See [README.md](./applications/README.md).
- Improving OCR text correction systems
- Improving grammatical error correction systems
- Improving Intent/Domain classifiers in conversational AI
- Spell Checking in Collaboration and Productivity tools

# Additional requirements

Requirements for ```Aspell``` checker:

```
wget https://files.pythonhosted.org/packages/53/30/d995126fe8c4800f7a9b31aa0e7e5b2896f5f84db4b7513df746b2a286da/aspell-python-py3-1.15.tar.bz2
tar -C . -xvf aspell-python-py3-1.15.tar.bz2
cd aspell-python-py3-1.15
python setup.py install
```

Requirements for ```Jamspell``` checker:

```
sudo apt-get install -y swig3.0
wget -P ./ https://github.com/bakwc/JamSpell-models/raw/master/en.tar.gz
tar xf ./en.tar.gz --directory ./
```

# Citation

```
@inproceedings{jayanthi-etal-2020-neuspell,
    title = "{N}eu{S}pell: A Neural Spelling Correction Toolkit",
    author = "Jayanthi, Sai Muralidhar  and
      Pruthi, Danish  and
      Neubig, Graham",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = oct,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-demos.21",
    doi = "10.18653/v1/2020.emnlp-demos.21",
    pages = "158--164",
    abstract = "We introduce NeuSpell, an open-source toolkit for spelling correction in English. Our toolkit comprises ten different models, and benchmarks them on naturally occurring misspellings from multiple sources. We find that many systems do not adequately leverage the context around the misspelt token. To remedy this, (i) we train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated misspellings; and (ii) use richer representations of the context. By training on our synthetic examples, correction rates improve by 9{\%} (absolute) compared to the case when models are trained on randomly sampled character perturbations. Using richer contextual representations boosts the correction rate by another 3{\%}. Our toolkit enables practitioners to use our proposed and existing spelling correction systems, both via a simple unified command line, as well as a web interface. Among many potential applications, we demonstrate the utility of our spell-checkers in combating adversarial misspellings. The toolkit can be accessed at neuspell.github.io.",
}
```

[Link](https://www.aclweb.org/anthology/2020.emnlp-demos.21/) for the publication. Any questions or suggestions, please
contact the authors at jsaimurali001 [at] gmail [dot] com

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/neuspell/neuspell",
    "name": "neuspellmyntra",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">3.5",
    "maintainer_email": null,
    "keywords": "transformer networks neuspell neural spelling correction embedding PyTorch NLP deep learning",
    "author": "Sai Muralidhar Jayanthi, Danish Pruthi, and Graham Neubig",
    "author_email": "jsaimurali001@gmail.com",
    "download_url": null,
    "platform": null,
    "description": "<h1 align=\"center\">\n<p>NeuSpell: A Neural Spelling Correction Toolkit\n</h1>\n\n# Contents\n\n- [Installation & Quick Start](#Installation)\n- Toolkit\n    - [Introduction](#Introduction)\n    - [Download Checkpoints](#Download-Checkpoints)\n    - [Download Datasets](#Datasets)\n    - [Demo Setup](#Demo-Setup)\n    - [Text Noising](#Synthetic-data-creation)\n- [Finetuning on custom data and creating new models](#Finetuning-on-custom-data-and-creating-new-models)\n- [Applications](#Potential-applications-for-practitioners)\n- [Additional Requirements](#Additional-requirements)\n\n# Updates\n\n### Latest\n\n- April 2021:\n    - APIs for creating synthetic data now available for English language.\n      See [Synthetic data creation](#Synthetic-data-creation).\n    - `neuspell` is now available through **pip**. See [Installation through pip](#Installation-through-pip)\n    - Added support for different transformer-based models such DistilBERT, XLM-RoBERTa, etc.\n      See [Finetuning on custom data and creating new models](#Finetuning-on-custom-data-and-creating-new-models)\n      section for more details.\n\n### Previous\n\n- March, 2021:\n    - Code-base reformatted. Addressed bug fixes and issues.\n- November, 2020:\n    - Neuspell's ```BERT``` pretrained model is now available as part of huggingface models\n      as ```murali1996/bert-base-cased-spell-correction```. We provide an example code snippet\n      at [./scripts/huggingface](./scripts/huggingface/huggingface-snippet-for-neuspell.py) for curious practitioners.\n- September, 2020:\n    - This work is accepted at EMNLP 2020 (system demonstrations)\n\n# Installation\n\n```bash\ngit clone https://github.com/neuspell/neuspell; cd neuspell\npip install -e .\n```\n\nTo install extra requirements,\n\n```bash\npip install -r extras-requirements.txt\n```\n\nor individually as:\n\n```bash\npip install -e .[elmo]\npip install -e .[spacy]\n```\n\nNOTE: For _zsh_, use \".[elmo]\" and \".[spacy]\" instead\n\nAdditionally, ```spacy models``` can be downloaded as:\n\n```bash\npython -m spacy download en_core_web_sm\n```\n\nThen, download pretrained models of `neuspell` following [Download Checkpoints](#Download-Checkpoints)\n\nHere is a quick-start code snippet (command line usage) to use a checker model.\nSee [test_neuspell_correctors.py](./tests/test_neuspell_correctors.py) for more usage patterns.\n\n```python\nimport neuspell\nfrom neuspell import available_checkers, BertChecker\n\n\"\"\" see available checkers \"\"\"\nprint(f\"available checkers: {neuspell.available_checkers()}\")\n# \u2192 available checkers: ['BertsclstmChecker', 'CnnlstmChecker', 'NestedlstmChecker', 'SclstmChecker', 'SclstmbertChecker', 'BertChecker', 'SclstmelmoChecker', 'ElmosclstmChecker']\n\n\"\"\" select spell checkers & load \"\"\"\nchecker = BertChecker()\nchecker.from_pretrained()\n\n\"\"\" spell correction \"\"\"\nchecker.correct(\"I luk foward to receving your reply\")\n# \u2192 \"I look forward to receiving your reply\"\nchecker.correct_strings([\"I luk foward to receving your reply\", ])\n# \u2192 [\"I look forward to receiving your reply\"]\nchecker.correct_from_file(src=\"noisy_texts.txt\")\n# \u2192 \"Found 450 mistakes in 322 lines, total_lines=350\"\n\n\"\"\" evaluation of models \"\"\"\nchecker.evaluate(clean_file=\"bea60k.txt\", corrupt_file=\"bea60k.noise.txt\")\n# \u2192 data size: 63044\n# \u2192 total inference time for this data is: 998.13 secs\n# \u2192 total token count: 1032061\n# \u2192 confusion table: corr2corr:940937, corr2incorr:21060,\n#                    incorr2corr:55889, incorr2incorr:14175\n# \u2192 accuracy is 96.58%\n# \u2192 word correction rate is 79.76%\n```\n\nAlternatively, once can also select and load a spell checker differently as follows:\n\n```python\nfrom neuspell import SclstmChecker\n\nchecker = SclstmChecker()\nchecker = checker.add_(\"elmo\", at=\"input\")  # \"elmo\" or \"bert\", \"input\" or \"output\"\nchecker.from_pretrained()\n```\n\nThis feature of adding ELMO or BERT model is currently supported for selected models.\nSee [List of neural models in the toolkit](#List-of-neural-models-in-the-toolkit) for details.\n\nIf interested, follow [Additional Requirements](#Additional-requirements) for installing non-neural spell\ncheckers- ```Aspell``` and ```Jamspell```.\n\n### Installation through pip\n\n```bash\npip install neuspell\n```\n\nIn v1.0, `allennlp` library is not automatically installed which is used for models containing ELMO. Hence, to utilize\nthose checkers, do a source install as in [Installation & Quick Start](#Installation)\n\n# Toolkit\n\n### Introduction\n\nNeuSpell is an open-source toolkit for context sensitive spelling correction in English. This toolkit comprises of 10\nspell checkers, with evaluations on naturally occurring mis-spellings from multiple (publicly available) sources. To\nmake neural models for spell checking context dependent, (i) we train neural models using spelling errors in context,\nsynthetically constructed by reverse engineering isolated mis-spellings; and  (ii) use richer representations of the\ncontext.This toolkit enables NLP practitioners to use our proposed and existing spelling correction systems, both via a\nsimple unified command line, as well as a web interface. Among many potential applications, we demonstrate the utility\nof our spell-checkers in combating adversarial misspellings.\n\n##### Live demo available at <http://neuspell.github.io/>\n\n<p align=\"center\">\n    <br>\n    <img src=\"https://github.com/neuspell/neuspell/blob/master/images/ui.png?raw=true\" width=\"400\"/>\n    <br>\n<p>\n\n##### List of neural models in the toolkit:\n\n- [```CNN-LSTM```](https://drive.google.com/file/d/14XiDY4BJ144fVGE2cfWfwyjnMwBcwhNa/view?usp=sharing)\n- [```SC-LSTM```](https://drive.google.com/file/d/1OvbkdBXawnefQF1d-tUrd9lxiAH1ULtr/view?usp=sharing)\n- [```Nested-LSTM```](https://drive.google.com/file/d/19ZhWvBaZqrsP5cGqBJdFPtufdyBqQprI/view?usp=sharing)\n- [```BERT```](https://huggingface.co/transformers/bertology.html)\n- [```SC-LSTM plus ELMO (at input)```](https://drive.google.com/file/d/1mjLFuQ0vWOOpPqTVkFZ_MSHiuVUmgHSK/view?usp=sharing)\n- [```SC-LSTM plus ELMO (at output)```](https://drive.google.com/file/d/1P8vX9ByOBQpN9oeho_iOJmFJByv1ifI5/view?usp=sharing)\n- [```SC-LSTM plus BERT (at input)```](https://huggingface.co/transformers/bertology.html)\n- [```SC-LSTM plus BERT (at output)```](https://huggingface.co/transformers/bertology.html)\n\n<p align=\"center\">\n    <br>\n    <img src=\"https://github.com/neuspell/neuspell/blob/master/images/pipeline.jpeg?raw=true\" width=\"400\"/>\n    <br>\n    This pipeline corresponds to the `SC-LSTM plus ELMO (at input)` model.\n<p>\n\n##### Performances\n\n| Spell<br>Checker    | Word<br>Correction <br>Rate | Time per<br>sentence <br>(in milliseconds) |\n|-------------------------------------|-----------------------|--------------------------------------|\n| ```Aspell```                        | 48.7                  | 7.3*                                 |\n| ``` Jamspell```                     | 68.9                  | 2.6*                                 |\n| ```CNN-LSTM```                      | 75.8                  | 4.2                                  |\n| ```SC-LSTM```                       | 76.7                  | 2.8                                  |\n| ```Nested-LSTM```                   | 77.3                  | 6.4                                  |\n| ```BERT```                          | 79.1                  | 7.1                                  |\n| ```SC-LSTM plus ELMO (at input)```  |  79.8                 | 15.8                                 |\n| ```SC-LSTM plus ELMO (at output)``` | 78.5                  | 16.3                                 |\n| ```SC-LSTM plus BERT (at input)```  | 77.0                  | 6.7                                  |\n| ```SC-LSTM plus BERT (at output)``` | 76.0                  | 7.2                                  |\n\nPerformance of different correctors in the NeuSpell toolkit on the  ```BEA-60K```  dataset with real-world spelling\nmistakes. \u2217 indicates evaluation on a CPU (for others we use a GeForce RTX 2080 Ti GPU).\n\n### Download Checkpoints\n\nTo download selected checkpoints, select a **Checkpoint name** from below and then run download. Each checkpoint is\nassociated with a neural spell checker as shown in the table.\n\n| Spell Checker                       | Class               | Checkpoint name             | Disk space (approx.) |\n|-------------------------------------|---------------------|-----------------------------|----------------------|\n| ```CNN-LSTM```                      | `CnnlstmChecker`    | 'cnn-lstm-probwordnoise'    | 450 MB               |\n| ```SC-LSTM```                       | `SclstmChecker`     | 'scrnn-probwordnoise'       | 450 MB               |\n| ```Nested-LSTM```                   | `NestedlstmChecker` | 'lstm-lstm-probwordnoise'   | 455 MB               |\n| ```BERT```                          | `BertChecker`       | 'subwordbert-probwordnoise' | 740 MB               |\n| ```SC-LSTM plus ELMO (at input)```  | `ElmosclstmChecker` | 'elmoscrnn-probwordnoise'   | 840 MB               |\n| ```SC-LSTM plus BERT (at input)```  | `BertsclstmChecker` | 'bertscrnn-probwordnoise'   | 900 MB               |\n| ```SC-LSTM plus BERT (at output)``` | `SclstmbertChecker` | 'scrnnbert-probwordnoise'   | 1.19 GB              |\n| ```SC-LSTM plus ELMO (at output)``` | `SclstmelmoChecker` | 'scrnnelmo-probwordnoise'   | 1.23 GB              |\n\n```python\nimport neuspell\n\nneuspell.seq_modeling.downloads.download_pretrained_model(\"subwordbert-probwordnoise\")\n```\n\nAlternatively, download all Neuspell neural models by running the following (available in versions after v1.0):\n\n```python\nimport neuspell\n\nneuspell.seq_modeling.downloads.download_pretrained_model(\"_all_\")\n```\n\nAlternatively,\n\n### Datasets\n\nWe curate several synthetic and natural datasets for training/evaluating neuspell models. For full details, check\nour [paper](#Citation). Run the following to download all the datasets.\n\n```\ncd data/traintest\npython download_datafiles.py \n```\n\nSee ```data/traintest/README.md``` for more details.\n\nTrain files are dubbed with names ```.random```, ```.word```, ```.prob```, ```.probword``` for different noising\nstartegies used to create them. For each strategy (see [Synthetic data creation](#Synthetic-data-creation)), we noise\n\u223c20% of the tokens in the clean corpus. We use 1.6 Million sentences from\nthe [```One billion word benchmark```](https://arxiv.org/abs/1312.3005) dataset as our clean corpus.\n\n### Demo Setup\n\nIn order to setup a demo, follow these steps:\n\n- Do [Installation](#Installation) and then install flask requirements as  ```pip install -e \".[flask]\"```\n- Download [checkpoints](#Pretrained-models) (__Note__: If you wish to use only one of the neural checkers, you need to\n  manually disable others in the imports of [./scripts/flask-server/app.py](./scripts/flask-server/app.py))\n- Start a flask server in folder [./scripts/flask-server](./scripts/flask-server) by\n  running `CUDA_VISIBLE_DEVICES=0 python app.py`\n  (on GPU) or `python app.py` (on CPU)\n\n### Synthetic data creation\n\n##### English\n\nThis toolkit offers 3 kinds of noising strategies (identfied from existing literature) to generate synthetic parallel\ntraining data to train neural models for spell correction. The strategies include a simple lookup based noisy spelling\nreplacement (`en-word-replacement-noise`), a character level noise induction such as swapping/deleting/adding/replacing\ncharacters (`en-char-replacement-noise`), and a confusion matrix based probabilistic character replacement driven by\nmistakes patterns in a large corpus of spelling mistakes (`en-probchar-replacement-noise`). For full details about these\napproaches, checkout our [paper](#Citation).\n\nFollowing are the corresponding class mappings to utilize the above noise curations. As some pre-built data files are\nused for some of the noisers, we also provide their approximate disk space.\n\n| Folder                          | Class name                                | Disk space (approx.) |\n|---------------------------------|-------------------------------------------|----------------------|\n| `en-word-replacement-noise`     | `WordReplacementNoiser`                   | 2 MB                 |\n| `en-char-replacement-noise`     | `CharacterReplacementNoiser`              | --                   |\n| `en-probchar-replacement-noise` | `ProbabilisticCharacterReplacementNoiser` | 80 MB                |\n\nFollowing is a snippet for using these noisers-\n\n```python\nfrom neuspell.noising import WordReplacementNoiser\n\nexample_texts = [\n    \"This is an example sentence to demonstrate noising in the neuspell repository.\",\n    \"Here is another such amazing example !!\"\n]\n\nword_repl_noiser = WordReplacementNoiser(language=\"english\")\nword_repl_noiser.load_resources()\nnoise_texts = word_repl_noiser.noise(example_texts)\nprint(noise_texts)\n```\n\n##### Other languages\n\n```\nComing Soon ...\n```\n\n# Finetuning on custom data and creating new models\n\n### Finetuning on top of `neuspell` pretrained models\n\n```python\nfrom neuspell import BertChecker\n\nchecker = BertChecker()\nchecker.from_pretrained()\nchecker.finetune(clean_file=\"sample_clean.txt\", corrupt_file=\"sample_corrupt.txt\", data_dir=\"default\")\n```\n\nThis feature is only available for `BertChecker` and `ElmosclstmChecker`.\n\n### Training other Transformers/BERT-based models\n\nWe now support initializing a huggingface model and finetuning it on your custom data. Here is a code snippet\ndemonstrating that:\n\nFirst mark your files containing clean and corrupt texts in a line-seperated format\n\n```python\nfrom neuspell.commons import DEFAULT_TRAINTEST_DATA_PATH\n\ndata_dir = DEFAULT_TRAINTEST_DATA_PATH\nclean_file = \"sample_clean.txt\"\ncorrupt_file = \"sample_corrupt.txt\"\n```\n\n```python\nfrom neuspell.seq_modeling.helpers import load_data, train_validation_split\nfrom neuspell.seq_modeling.helpers import get_tokens\nfrom neuspell import BertChecker\n\n# Step-0: Load your train and test files, create a validation split\ntrain_data = load_data(data_dir, clean_file, corrupt_file)\ntrain_data, valid_data = train_validation_split(train_data, 0.8, seed=11690)\n\n# Step-1: Create vocab file. This serves as the target vocab file and we use the defined model's default huggingface\n# tokenizer to tokenize inputs appropriately.\nvocab = get_tokens([i[0] for i in train_data], keep_simple=True, min_max_freq=(1, float(\"inf\")), topk=100000)\n\n# # Step-2: Initialize a model\nchecker = BertChecker(device=\"cuda\")\nchecker.from_huggingface(bert_pretrained_name_or_path=\"distilbert-base-cased\", vocab=vocab)\n\n# Step-3: Finetune the model on your dataset\nchecker.finetune(clean_file=clean_file, corrupt_file=corrupt_file, data_dir=data_dir)\n```\n\nYou can further evaluate your model on a custom data as follows:\n\n```python\nfrom neuspell import BertChecker\n\nchecker = BertChecker()\nchecker.from_pretrained(\n    bert_pretrained_name_or_path=\"distilbert-base-cased\",\n    ckpt_path=f\"{data_dir}/new_models/distilbert-base-cased\"  # \"<folder where the model is saved>\"\n)\nchecker.evaluate(clean_file=clean_file, corrupt_file=corrupt_file, data_dir=data_dir)\n```\n\n### Multilingual Models\n\nFollowing usage above, once can now seamlessly utilize multilingual models such as `xlm-roberta-base`,\n`bert-base-multilingual-cased` and `distilbert-base-multilingual-cased` on a non-English script.\n\n# Potential applications for practitioners\n\n- Defenses against adversarial attacks in NLP\n    - example implementation available in folder ```./applications/Adversarial-Misspellings-arxiv```.\n      See [README.md](./applications/README.md).\n- Improving OCR text correction systems\n- Improving grammatical error correction systems\n- Improving Intent/Domain classifiers in conversational AI\n- Spell Checking in Collaboration and Productivity tools\n\n# Additional requirements\n\nRequirements for ```Aspell``` checker:\n\n```\nwget https://files.pythonhosted.org/packages/53/30/d995126fe8c4800f7a9b31aa0e7e5b2896f5f84db4b7513df746b2a286da/aspell-python-py3-1.15.tar.bz2\ntar -C . -xvf aspell-python-py3-1.15.tar.bz2\ncd aspell-python-py3-1.15\npython setup.py install\n```\n\nRequirements for ```Jamspell``` checker:\n\n```\nsudo apt-get install -y swig3.0\nwget -P ./ https://github.com/bakwc/JamSpell-models/raw/master/en.tar.gz\ntar xf ./en.tar.gz --directory ./\n```\n\n# Citation\n\n```\n@inproceedings{jayanthi-etal-2020-neuspell,\n    title = \"{N}eu{S}pell: A Neural Spelling Correction Toolkit\",\n    author = \"Jayanthi, Sai Muralidhar  and\n      Pruthi, Danish  and\n      Neubig, Graham\",\n    booktitle = \"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations\",\n    month = oct,\n    year = \"2020\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/2020.emnlp-demos.21\",\n    doi = \"10.18653/v1/2020.emnlp-demos.21\",\n    pages = \"158--164\",\n    abstract = \"We introduce NeuSpell, an open-source toolkit for spelling correction in English. Our toolkit comprises ten different models, and benchmarks them on naturally occurring misspellings from multiple sources. We find that many systems do not adequately leverage the context around the misspelt token. To remedy this, (i) we train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated misspellings; and (ii) use richer representations of the context. By training on our synthetic examples, correction rates improve by 9{\\%} (absolute) compared to the case when models are trained on randomly sampled character perturbations. Using richer contextual representations boosts the correction rate by another 3{\\%}. Our toolkit enables practitioners to use our proposed and existing spelling correction systems, both via a simple unified command line, as well as a web interface. Among many potential applications, we demonstrate the utility of our spell-checkers in combating adversarial misspellings. The toolkit can be accessed at neuspell.github.io.\",\n}\n```\n\n[Link](https://www.aclweb.org/anthology/2020.emnlp-demos.21/) for the publication. Any questions or suggestions, please\ncontact the authors at jsaimurali001 [at] gmail [dot] com\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "NeuSpell: A Neural Spelling Correction Toolkit",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/neuspell/neuspell"
    },
    "split_keywords": [
        "transformer",
        "networks",
        "neuspell",
        "neural",
        "spelling",
        "correction",
        "embedding",
        "pytorch",
        "nlp",
        "deep",
        "learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "81394d87fe87f42e3eb6f1f45841cb2cf1c0751f6e79db5343890476ff4aa863",
                "md5": "c3b57e2acfb63d0204354fac690b04c5",
                "sha256": "7d6c0e231c891dc0198572e7ad2a5e732e91be6f2ae99d0f2835bb6b927d37ad"
            },
            "downloads": -1,
            "filename": "neuspellmyntra-1.0.0-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c3b57e2acfb63d0204354fac690b04c5",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">3.5",
            "size": 89255,
            "upload_time": "2024-09-19T14:24:01",
            "upload_time_iso_8601": "2024-09-19T14:24:01.173395Z",
            "url": "https://files.pythonhosted.org/packages/81/39/4d87fe87f42e3eb6f1f45841cb2cf1c0751f6e79db5343890476ff4aa863/neuspellmyntra-1.0.0-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-19 14:24:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "neuspell",
    "github_project": "neuspell",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "neuspellmyntra"
}

Sai Muralidhar Jayanthi, Danish Pruthi, and Graham Neubig