sage-spelling

Name	sage-spelling JSON
Version	1.1.0 JSON
	download
home_page	https://github.com/orgs/ai-forever/sage
Summary	SAGE: Spell checking via Augmentation and Generative distribution Emulation
upload_time	2024-07-21 11:44:46
maintainer	None
docs_url	None
author	Nikita Martynov, Mark Baushenko, Alena Fenogenova and Alexandr Abramov
requires_python	<=3.12.3,>=3.8.0
license	MIT
keywords	sage spelling correction nlp deep learning transformers pytorch
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <p align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="images/sage-white.svg">
    <source media="(prefers-color-scheme: light)" srcset="images/sage-black.svg">
    <img alt="SAGE" src="images/sage-white.svg" style="max-width: 100%;">
  </picture>
</p>

<p align="center">
    <a href="https://opensource.org/licenses/MIT">
    <img alt="License" src="https://img.shields.io/badge/License-MIT-yellow.svg">
    </a>
    <a href="https://github.com/ai-forever/sage/releases">
    <img alt="Release" src="https://badgen.net/badge/release/v1.1.0/">
    </a>
    <a href="https://arxiv.org/abs/2308.09435">
    <img alt="Paper" src="https://img.shields.io/badge/arXiv-2308.09435-red">
    </a>
    <a href='https://sage-documentation-main.readthedocs.io/en/latest/?badge=latest'>
    <img src='https://readthedocs.org/projects/sage-documentation-main/badge/?version=latest' alt='Documentation Status' />
    </a>
</p>

<h2 align="center">
    <p> Spelling correction, corruption and evaluation for multiple languages
</p>
</h2>

<div align="center">
  <h4>
    <a href="#installation">Install</a> |
    <a href="#spelling-correction">Models</a> |
    <a href="#evaluation">Evaluation</a> |
    <a href="#statistic-based-spelling-corruption-sbsc">SBSC</a> |
    <a href="#augmentex">Augmentex</a> |
    <a href="#citation">Papers</a>
  </h4>
</div>

SAGE (Spell checking via Augmentation and Generative distribution Emulation) is 
a complete solution that you need when working on a spelling problem:

- 💯 Spelling correction with State-of-the-art pre-trained 🤗Transformer models:
  - [sage-fredt5-large](https://huggingface.co/ai-forever/sage-fredt5-large)
  - [sage-fredt5-distilled-95m](https://huggingface.co/ai-forever/sage-fredt5-distilled-95m)
  - [sage-mt5-large](https://huggingface.co/ai-forever/sage-mt5-large)
  - [sage-m2m100-1.2B](https://huggingface.co/ai-forever/sage-m2m100-1.2B)
  - [T5-large](https://huggingface.co/ai-forever/T5-large-spell) 
  - [M2M100-1.2B](https://huggingface.co/ai-forever/RuM2M100-1.2B) [Earlier release]
  - [M2M100-418M](https://huggingface.co/ai-forever/RuM2M100-418M) [Earlier release]
  - [FredT5-large](https://huggingface.co/ai-forever/FRED-T5-large-spell) [Earlier release]

  You can test them out right here [![Try Model Generation In Colab!](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ai-forever/sage/blob/main/notebooks/text_correction_demo.ipynb)
- 🧩 Augment your data with spelling corruption algorithms, take a look at a quick demo [![Try Model Generation In Colab!](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ai-forever/sage/blob/main/notebooks/text_corruption_demo.ipynb)
- 📊 Evaluate performance of spelling correction tools.

## News
🔥 **[2024-01-18]**: Our paper *"A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages"* is accepted for EACL 2024 conference!

💥 **[2024-04-11]**: SAGE v1.1.0 is finally out: a comprehensive note about the details of release can be found [here](https://habr.com/ru/companies/sberdevices/articles/806897/). 

## Table of contents

- [Installation](#installation)
  - [Regular install](#regular-install)
  - [Editable install](#editable-install)
- [Quick demo](#quick-demo)
- [Spelling corruption](#spelling-corruption)
  - [Statistic-based Spelling Corruption (SBSC)](#statistic-based-spelling-corruption-sbsc)
  - [Augmentex](#augmentex)
- [Spelling correction](#spelling-correction)
  - [RUSpellRU evaluation](#ruspellru-evaluation)
  - [MultidomainGold evaluation](#multidomaingold-evaluation)
  - [MedSpellchecker evaluation](#medspellchecker-evaluation)
  - [GitHubTypoCorpusRu evaluation](#githubtypocorpusru-evaluation)
- [Evaluation](#evaluation)
- [Citation](#citation)

## Installation
### Regular install
```commandline
git clone https://github.com/ai-forever/sage.git
cd sage
pip install .
```

To install extra requirements that you are going to need when working with ERRANT-based metric run

```commandline
python -m spacy download ru_core_news_lg
pip install -e .[errant]
```

### Editable install
```commandline
git clone https://github.com/ai-forever/sage.git
cd sage
pip install -e .
```
and proceed with extra requirements install as above. 

## Quick demo
Lets spoil some text:
```python
import sage
from sage.spelling_corruption import SBSCConfig, SBSCCorruptor
from sage.utils import DatasetsAvailable

text = "Заметьте, не я это предложил!"

# Instantiate SBSC corruptor from a dataset with errors in medical anamnesis
config = SBSCConfig(
    reference_dataset_name_or_path=DatasetsAvailable.MedSpellchecker.name,
    reference_dataset_split="test"
)
corruptor = SBSCCorruptor.from_config(config)

corruptor.corrupt(text, seed=1)
# 'Заиетьте, не я эт о пред ложил!'
```
... now with Augmentex:

```python
import sage
from sage.spelling_corruption import WordAugConfig, WordAugCorruptor

text = "Заметьте, не я это предложил!"

# Instantiate WordAugCorruptor corruptor with a custom set of parameters
config = WordAugConfig(
    min_aug=1,
    max_aug=5,
    unit_prob=0.4,
)
corruptor = WordAugCorruptor.from_config(config)

corruptor.corrupt(text, seed=1)
# 'это не предложил! Заметьте, я'
```

... or for the English language:

```python
import os
from sage.spelling_corruption import SBSCConfig, SBSCCorruptor

text = "Screw you guys, I am going home. (c)"

# Instantiate SBSC corruptor from a JFLEG dataset
config = SBSCConfig(
    lang="en",
    reference_dataset_name_or_path=os.path.join("data", "example_data", "jfleg"),
)
corruptor = SBSCCorruptor.from_config(config)

corruptor.corrupt(text, seed=1)
# 'Screw you kuys, I am going home. (c)'
```

Now we can use our models to restore the initial text back:
```python
from sage.spelling_correction import AvailableCorrectors
from sage.spelling_correction import RuM2M100ModelForSpellingCorrection, T5ModelForSpellingCorruption

text_ru = "Замтьте не я это предложил"
text_en = "Screw you kuys, I am going home. (c)"

corrector_fred = T5ModelForSpellingCorruption.from_pretrained(AvailableCorrectors.sage_fredt5_large.value)
corrector_m2m = RuM2M100ModelForSpellingCorrection.from_pretrained(AvailableCorrectors.m2m100_1B.value)
corrector_en = T5ModelForSpellingCorruption.from_pretrained(AvailableCorrectors.ent5_large.value)

print(corrector_fred.correct(text_ru))
# ['Заметьте, не я это предложил.']

print(corrector_m2m.correct(text_ru))
# ['Заметьте не я это предложил']

print(corrector_en.correct(text_en, prefix="grammar: "))
# ['Screw you guys, I am going home. (c)']
```

Evaluate performance of the models on open benchmarks for spelling correction:
```python
import os
import torch
from sage.utils import DatasetsAvailable
from sage.spelling_correction import AvailableCorrectors
from sage.spelling_correction import T5ModelForSpellingCorruption

corrector_fred_95m = T5ModelForSpellingCorruption.from_pretrained(AvailableCorrectors.sage_fredt5_distilled_95m.value)
corrector_mt5 = T5ModelForSpellingCorruption.from_pretrained(AvailableCorrectors.sage_mt5_large.value)

corrector_fred_95m.model.to(torch.device("cuda:0"))
corrector_mt5.model.to(torch.device("cuda:0"))

metrics = corrector_fred_95m.evaluate("RUSpellRU", metrics=["errant", "ruspelleval"], batch_size=32)
print(metrics)
# {'CASE_Precision': 94.41, 'CASE_Recall': 92.55, 'CASE_F1': 93.47, 'SPELL_Precision': 77.52, 'SPELL_Recall': 64.09, 'SPELL_F1': 70.17, 'PUNCT_Precision': 86.77, 'PUNCT_Recall': 80.59, 'PUNCT_F1': 83.56, 'YO_Precision': 46.21, 'YO_Recall': 73.83, 'YO_F1': 56.84, 'Precision': 83.48, 'Recall': 74.75, 'F1': 78.87}

metrics = corrector_mt5.evaluate("/content/sage/data/example_data/jfleg", metrics=["ruspelleval"], batch_size=16)
print(metrics)
# {'Precision': 75.94, 'Recall': 88.15, 'F1': 81.59}

```

_NOTE_: if you are launching code snippet in Colab you'd probably end up with MEMORY ERROR, so manage evaluation 
procedures so that you meet available device's restrictions. As a feasible workaround you can execute 
```python
del corrector_fred_95m.model
```
to free some space. 

## Spelling Corruption
We implemented two methods for spelling corruption. **S**tatistic-**b**ased **S**pelling **C**orruption (**SBSC**) aims 
to mimic human behaviour when making an error. While [Augmentex](#augmentex) relies on rule-based heuristics and common
errors and mistypings especially those committed while typing text on a keyboard. 

🚀 Both methods proved their effectiveness for spelling correction systems and celebrated substantial **performance gains**
fully reported in our [Paper](https://aclanthology.org/2024.findings-eacl.10/).

### Statistic-based Spelling Corruption (SBSC)
This method is thoroughly described in our another [Paper](https://www.dialog-21.ru/media/5914/martynovnplusetal056.pdf) 
and in this 🗣️[Talk](https://youtu.be/yFfkV0Qjuu0?si=XmKfocCSLnKihxS_). 

Briefly, SBSC follows two simple steps:
- 🧠 Analyze errors, their type and positions in a source text;
- ✏️ Reproduce errors from the source text in a new sentence;

🧠 To analyze errors in a source sentence we need its corresponding correction in order to build 
[Levenshtein matrix](https://en.wikipedia.org/wiki/Levenshtein_distance), traverse it back starting from the 
bottom right entry and determine the exact position and type of an error. We then aggregate all obtained statistics and 
normalize it to valid discrete distributions. 

✏️ "Reproduce" step is even less complicated: we just sample number of errors per sentence, their types and relative
positions from corresponding distributions and apply them to a correct sentence.

As stated, you need a parallel dataset to "fit" SBSC. We provide a set of four datasets with natural errors covering
exhaustive range of domains:

- **RUSpellRU**: texts collected from [LiveJournal](https://www.livejournal.com/media), with manually corrected typos and errors;
- **MultidomainGold**: examples from 7 text sources, including the open web, news, social media, reviews, subtitles, policy documents and literary works;
- **MedSpellChecker**: texts with errors from medical anamnesis;
- **GitHubTypoCorpusRu**: spelling errors and typos in commits from GitHub;

You can use them as simple as
```python
import sage
from sage.spelling_corruption import SBSCConfig, SBSCCorruptor
from sage.utils import DatasetsAvailable

# Instantiate SBSC corruptor from a dataset with errors in medical anamnesis
config = SBSCConfig(
    reference_dataset_name_or_path=DatasetsAvailable.MedSpellchecker.name,
    reference_dataset_split="test"
)
corruptor = SBSCCorruptor.from_config(config)
```

... or you can initialize your SBSC from locally stored dataset:
```python
import os
from sage.spelling_corruption import SBSCConfig, SBSCCorruptor

# Instantiate SBSC corruptor from a JFLEG dataset
config = SBSCConfig(
    lang="en",
    reference_dataset_name_or_path=os.path.join("data", "example_data", "jfleg"),
)
corruptor = SBSCCorruptor.from_config(config)
```

✅ To check how good SBSC actually approximates original errors, you can plot side-by-side graphs of original and 
synthetically generated distributions:

<p align="center">
    <br>
    <img src="images/ruspellru_side_by_side.jpg" width="400" style="float:center; padding-right:60px"/> 
    <img src="images/bea60k_side_by_side.jpg" width="400" style="float:center; padding-left:60px"/>
    <br>
<p>

To access these graphs you can simply
```python
from sage.utils import load_available_dataset_from_hf, draw_and_save_errors_distributions_comparison_charts
from sage.spelling_corruption.sbsc.labeler import process_mistypings
from sage.spelling_corruption import SBSCCorruptor

sources, corrections = load_available_dataset_from_hf("RUSpellRU", for_labeler=True, split="train")
ruspellru_stats, ruspellru_confusion_matrix, ruspellru_typos_cnt = process_mistypings(sources, corrections)

corruptor = SBSCCorruptor.from_default_config()
spoiled_sentences = corruptor.batch_corrupt(corrections)

sbsc_stats, sbsc_confusion_matrix, sbsc_typos_cnt = process_mistypings(spoiled_sentences, corrections)

draw_and_save_errors_distributions_comparison_charts(
    actual_typos_cnt = sbsc_typos_cnt,
    reference_typos_cnt=ruspellru_typos_cnt,
    actual_stats=sbsc_stats,
    reference_stats=ruspellru_stats,
    path_to_save="ruspellru_sbsc.jpg"
)
```

### Augmentex
Augmentex introduces rule-based and common statistic (empowered by [KartaSlov](https://kartaslov.ru) project) 
approach to insert errors in text. It is fully described again in the [Paper](https://www.dialog-21.ru/media/5914/martynovnplusetal056.pdf)
and in this 🗣️[Talk](https://youtu.be/yFfkV0Qjuu0?si=XmKfocCSLnKihxS_).

🖇️ Augmentex allows you to operate on two levels of granularity when it comes to text corruption and offers you sets of 
specific methods suited for particular level:
- **Word level**:
  - _replace_ - replace a random word with its incorrect counterpart;
  - _delete_ - delete random word;
  - _swap_ - swap two random words;
  - _stopword_ - add random words from stop-list;
  - _reverse_ - change a case of the first letter of a random word;
- **Character level**:
  - _shift_ - randomly swaps upper / lower case in a string;
  - _orfo_ - substitute correct characters with their common incorrect counterparts;
  - _typo_ - substitute correct characters as if they are mistyped on a keyboard;
  - _delete_ - delete random character;
  - _multiply_ - multiply random character;
  - _swap_ - swap two adjacent characters;
  - _insert_ - insert random character;

To access Augmentex you only need these few manipulations:
```python
from sage.spelling_corruption import CharAugConfig, CharAugCorruptor

config = CharAugConfig(
    unit_prob=0.3, # proportion of characters that is going to undergo edits
    min_aug=1, # minimum number of edits
    max_aug=5, # maximum number of edits 
    mult_num=3 # `multiply` edit
)
corruptor = CharAugCorruptor.from_config(config)
```

... or like this:

```python
from sage.spelling_corruption import WordAugConfig, WordAugCorruptor

config = WordAugConfig(
    unit_prob=0.4, # proportion of characters that is going to undergo edits
    min_aug=1, # minimum number of edits
    max_aug=5, # maximum number of edits 
)
corruptor = WordAugCorruptor.from_config(config)
```

Augmentex has been created by our fellow team, the project has its own [repo](https://github.com/ai-forever/augmentex), do not forget to take a look! 

## Spelling Correction
Our methodology for obtaining model with optimal performance on spellchecking task is thoroughly described in our
[Paper](https://aclanthology.org/2024.findings-eacl.10/). And the algorithm is simple and generally consists of two steps:

- Pre-train model on extensive parallel corpus with synthetically generated errors;
- Fine-tune on combinations of available datasets for spelling correction with "human-made" errors;

We use [Augmentex](#augmentex) and [SBSC](#statistic-based-spelling-corruption-sbsc) for both generating large synthetic corpora and augmenting datasets with natural errors. 
The family of pre-trained correctors now amounts for 8 models.

We've 6 🤗Transformer models for Russian 🇷🇺:
  - [sage-fredt5-large](https://huggingface.co/ai-forever/sage-fredt5-large)
  - [sage-fredt5-distilled-95m](https://huggingface.co/ai-forever/sage-fredt5-distilled-95m)
  - [sage-m2m100-1.2B](https://huggingface.co/ai-forever/sage-m2m100-1.2B)
  - [M2M100-1.2B](https://huggingface.co/ai-forever/RuM2M100-1.2B) [Earlier release]
  - [M2M100-418M](https://huggingface.co/ai-forever/RuM2M100-418M) [Earlier release]
  - [FredT5-large](https://huggingface.co/ai-forever/FRED-T5-large-spell) [Earlier release]

And two models for English 🇬🇧:
  - [T5-large](https://huggingface.co/ai-forever/T5-large-spell)
  - [sage-mt5-large](https://huggingface.co/ai-forever/sage-mt5-large)

Models for the Russian language have been pre-trained on combination of Russian Wikipedia and videos transcriptions with 
artificial errors generated by [SBSC](#statistic-based-spelling-corruption-sbsc) on statistics gathered from train split of [RUSpellRU](https://huggingface.co/datasets/ai-forever/spellcheck_benchmark). 
Correctors for English trained on mixture of English Wikipedia articles and news posts with synthetic errors inserted by [SBSC](#statistic-based-spelling-corruption-sbsc) fitted on statistics from 5k subsample
of [BEA60k](https://github.com/neuspell/neuspell/tree/master).

📚 We also validate our solutions on available datasets with "human-made" errors:

- **RUSpellRU**: texts collected from [LiveJournal](https://www.livejournal.com/media), with manually corrected typos and errors;
- **MultidomainGold**: examples from 7 text sources, including the open web, news, social media, reviews, subtitles, policy documents and literary works;
- **MedSpellChecker**: texts with errors from medical anamnesis;
- **GitHubTypoCorpusRu**: spelling errors and typos in commits from GitHub;
- **BEA60K**: English spelling errors collected from several domains;
- **JFLEG**: 1601 sentences in English, which contain about 2 thousand spelling errors;

📈 Here we report evaluation of some setups:
- Zero-shot evaluation of pre-trained checkpoints;
- Additional fine-tuning (ft.) on the target dataset;

Full list of setups and corresponding performances are in the [Paper](https://aclanthology.org/2024.findings-eacl.10/).

**RUSpellRU**, **MultidomainGold**, **MedSpellChecker** and **GitHubTypoCorpusRu** come from [spellcheck_punctuation_benchmark](https://huggingface.co/datasets/ai-forever/spellcheck_punctuation_benchmark).
The benchmark accounts for both punctuation and spelling errors. For the simplicity and better representativeness we report results only for those models
([sage-fredt5-large](https://huggingface.co/ai-forever/sage-fredt5-large), [sage-fredt5-distilled-95m](https://huggingface.co/ai-forever/sage-fredt5-distilled-95m)) that deal with both types of errors (the Russian language).
The detailed metrics for other checkpoints can be found either in the [Paper](https://aclanthology.org/2024.findings-eacl.10/), [post](ссылка на новый хабр) or corresponding model card. 

_NOTE:_ **MedSpellChecker** and **GitHubTypoCorpusRu** do not have train split, so their performance on 
**Pre-train + fine-tune** setup is reported as a result of fine-tuning on combination of **RUSpellRU** and **MultidomainGold**
datasets.

#### RUSpellRU Evaluation

| Model                      | Pr. (spell) | Rec. (spell) | F1 (spell) | Pr. (punc) | Rec. (punc) | F1 (punc) | Pr. (case) | Rec. (case) | F1 (case) |
|----------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| sage-ai-service            | 90.3 | 86.3 | 88.2 | 90.3 | 86.6 | 88.4 | 95.2 | 95.9 | 95.6 |
| sage-fredt5-large          | 57.3 | 68.0 | 62.2 | 86.7 | 46.1 | 60.2 | 92.1 | 67.8 | 78.1 |
| sage-fredt5-large (ft.)    | 88.4 | 80.9 | 84.5 | 88.2 | 85.3 | 86.8 | 95.5 | 94.0 | 94.7 |
| sage-fredt5-distilled-95m (ft.)  | 83.5 | 74.8 | 78.9 | 86.8 | 80.6 | 83.6 | 94.4 | 92.5 | 93.5 |
| gpt-3.5-turbo              | 33.6 | 58.5 | 42.7 | 85.9 | 64.6 | 73.7 | 84.9 | 73.9 | 79.0 |
| gpt-4                      | 54.9 | 76.7 | 64.0 | 84.0 | 82.3 | 83.2 | 91.5 | 90.2 | 90.9 |


#### MultidomainGold Evaluation

| Model                           | Pr. (spell) | Rec. (spell) | F1 (spell) | Pr. (punc) | Rec. (punc) | F1 (punc) | Pr. (case) | Rec. (case) | F1 (case) |
|---------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| sage-ai-service                 | 81.6 | 77.7 | 79.6 | 70.2 | 67.5 | 68.8 | 80.5 | 80.5 | 80.5 |
| sage-fredt5-large               | 43.4 | 49.7 | 46.3 | 21.8 | 21.3 | 21.6 | 58.8 | 23.9 | 34.0 |
| sage-fredt5-large (ft.)         | 80.3 | 75.1 | 77.6 | 69.0 | 66.5 | 67.7 | 78.6 | 80.0 | 79.3 |
| sage-fredt5-distilled-95m (ft.) | 77.2 | 69.9 | 73.4 | 66.8 | 63.4 | 65.0 | 76.8 | 79.1 | 77.9 |
| gpt-3.5-turbo                   | 18.8 | 48.1 | 27.1 | 42.0 | 31.8 | 36.2 | 47.1 | 51.3 | 49.1 |
| gpt-4                           | 25.4 | 68.0 | 37.0 | 57.8 | 54.3 | 56.0 | 54.0 | 67.5 | 60.0 |



#### MedSpellchecker Evaluation

| Model                           | Pr. (spell) | Rec. (spell) | F1 (spell) | Pr. (punc) | Rec. (punc) | F1 (punc) | Pr. (case) | Rec. (case) | F1 (case) |
|---------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| sage-ai-service                 | 71.3 | 73.5 | 72.4 | 75.1 | 69.2 | 72.0 | 80.9 | 72.8 | 76.6|
| sage-fredt5-large               | 35.2 | 54.5 | 42.8 | 19.2 | 13.2 | 15.7 | 48.7 | 36.8 | 41.9 |
| sage-fredt5-large (ft.)         | 72.5 | 72.2 | 72.3 | 74.6 | 66.4 | 70.3 | 79.3 | 85.1 | 82.1 |
| sage-fredt5-distilled-95m (ft.) | 65.1 | 64.8 | 64.9 | 78.6 | 63.1 | 70.0 | 63.5 | 74.7 | 68.7 |
| gpt-3.5-turbo                   | 14.7 | 45.9 | 22.3 | 69.9 | 52.3 | 59.8 | 26.4 | 41.8 | 32.3 |
| gpt-4                           | 37.8 | 72.3 | 49.6 | 81.4 | 64.3 | 71.9 | 73.0 | 62.1 | 67.1 |


#### GitHubTypoCorpusRu Evaluation

| Model                           | Pr. (spell) | Rec. (spell) | F1 (spell) | Pr. (punc) | Rec. (punc) | F1 (punc) | Pr. (case) | Rec. (case) | F1 (case) |
|---------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| sage-ai-service                 | 70.8 | 56.3 | 62.7 | 48.9 | 35.8 | 41.4 | 32.9 | 45.3 | 38.1|
| sage-fredt5-large               | 46.0 | 46.6 | 46.3 | 22.7 | 18.3 | 20.2 | 12.0 | 13.2 | 12.6 |
| sage-fredt5-large (ft.)         | 67.5 | 53.2 | 59.5 | 48.5  | 38.0 | 42.6 | 37.3 | 50.0 | 42.7 |
| sage-fredt5-distilled-95m (ft.) | 57.8 | 48.5 | 52.7 | 45.2 | 39.5 | 42.1 | 29.9 | 46.2 | 36.3 |
| gpt-3.5-turbo                   | 23.7 | 38.7 | 29.4 | 37.6 | 23.3 | 28.7 | 19.6 | 35.9 | 25.3 |
| gpt-4                           | 27.0 | 52.8 | 35.7 | 45.9 | 32.6 | 38.2 | 25.7 | 36.8 | 30.2 |


#### BEA60K Evaluation

| Model | Precision | Recall | F1 |
| --- | --- | --- | --- |
| sage-mt5-large | 64.7 | 83.8 | 73.0 | 
| T5-large-spell | 66.5 | 83.1 | 73.9 |
| gpt-3.5-turbo |  66.9 | 84.1 | 74.5 | 
| gpt-4 | 68.6 | 85.2 | 76.0 | 
| [Bert](https://github.com/neuspell/neuspell) | 65.8 | 79.6 | 72.0 |
| [SC-LSTM](https://github.com/neuspell/neuspell) | 62.2 | 80.3 | 72.0 |


#### JFLEG Evaluation

| Model | Precision | Recall | F1 |
| --- | --- | --- | --- |
| sage-mt5-large | 74.9 | 88.4 | 81.1 |
| T5-large-spell | 83.4 | 84.3 | 83.8 |
| gpt-3.5-turbo |  77.8 | 88.6 | 82.9 | 
| gpt-4 | 77.9 | 88.3 | 82.8 |
| [Bert](https://github.com/neuspell/neuspell) | 78.5 | 85.4 | 81.8 |
| [SC-LSTM](https://github.com/neuspell/neuspell) | 80.6 | 86.1 | 83.2 |

**RUSpellRU**, **MultidomainGold**, **MedSpellChecker** and **GitHubTypoCorpusRu** are available as HuggingFace datasets [here](https://huggingface.co/datasets/ai-forever/spellcheck_punctuation_benchmark) and through the API of our library: 
```python
from sage.utils import load_available_dataset_from_hf, DatasetsAvailable

print([dataset.name for dataset in DatasetsAvailable])
# ['MultidomainGold', 'RUSpellRU', 'MedSpellchecker', 'GitHubTypoCorpusRu', 'MultidomainGold_orth', 'RUSpellRU_orth', 'MedSpellchecker_orth', 'GitHubTypoCorpusRu_orth']

gold_dataset = load_available_dataset_from_hf(DatasetsAvailable.MultidomainGold.name, for_labeler=False)
print(len(gold_dataset))
# 7675

sources, corrections = load_available_dataset_from_hf(DatasetsAvailable.RUSpellRU.name, for_labeler=True, split="train")
print(len(sources), len(corrections))
# 2000 2000
```

## Evaluation
We also provide functionality to evaluate the performance of spelling correction systems and rank them. 

🎯 Currently two options are available:
- [ruspelleval](https://www.dialog-21.ru/media/3427/sorokinaaetal.pdf);
- [ERRANT](https://github.com/chrisjbryant/errant)-based metric adapted for the Russian language;

Both algorithms output Precision, Recall and F1 scores that can be interpreted like the following:
- **Precision**: one minus share of unnecessary amendments; 
- **Recall**: proportion of expected corrections;
- **F1**: famous geometric mean of aforementioned two;

You can obtain these metrics simply by
```python
from sage.evaluation import Scorer
from sage.utils import DatasetsAvailable, load_available_dataset_from_hf

sources, corrections = load_available_dataset_from_hf(DatasetsAvailable.RUSpellRU.name, for_labeler=True, split="test")

scorer = Scorer()
metrics = scorer.score(sources, corrections, corrections, metrics=["ruspelleval", "errant"])
print(metrics)
# {'Precision': 100.0, 'Recall': 100.0, 'F1': 100.0, 'CASE_Precision': 100.0, 'CASE_Recall': 100.0, 'CASE_F1': 100.0, 'SPELL_Precision': 100.0, 'SPELL_Recall': 100.0, 'SPELL_F1': 100.0, 'PUNCT_Precision': 100.0, 'PUNCT_Recall': 100.0, 'PUNCT_F1': 100.0, 'YO_Precision': 100.0, 'YO_Recall': 100.0, 'YO_F1': 100.0}
```

... or by directly assessing the model:
```python
import os
import torch
from sage.utils import DatasetsAvailable
from sage.spelling_correction import AvailableCorrectors
from sage.spelling_correction import T5ModelForSpellingCorruption

corrector_fred_95m = T5ModelForSpellingCorruption.from_pretrained(AvailableCorrectors.sage_fredt5_distilled_95m.value)
corrector_mt5 = T5ModelForSpellingCorruption.from_pretrained(AvailableCorrectors.sage_mt5_large.value)

corrector_fred_95m.model.to(torch.device("cuda:0"))
corrector_mt5.model.to(torch.device("cuda:0"))

metrics = corrector_fred_95m.evaluate("RUSpellRU", metrics=["errant", "ruspelleval"], batch_size=32)
print(metrics)
# {'CASE_Precision': 94.41, 'CASE_Recall': 92.55, 'CASE_F1': 93.47, 'SPELL_Precision': 77.52, 'SPELL_Recall': 64.09, 'SPELL_F1': 70.17, 'PUNCT_Precision': 86.77, 'PUNCT_Recall': 80.59, 'PUNCT_F1': 83.56, 'YO_Precision': 46.21, 'YO_Recall': 73.83, 'YO_F1': 56.84, 'Precision': 83.48, 'Recall': 74.75, 'F1': 78.87}

metrics = corrector_mt5.evaluate("/content/sage/data/example_data/jfleg", metrics=["ruspelleval"], batch_size=16)
print(metrics)
# {'Precision': 75.94, 'Recall': 88.15, 'F1': 81.59}
```

The metrics output by ERRANT based algorithm are indicated by the corresponding prefix, which refers to the specific type of errors:
- *CASE*: erroneously used case;
- *SPELL*: spelling and grammar errors;
- *PUNCT*: punctuation errors;
- *YO*: unnecessary replacement of "YO" (ё) letter;

📌 Credit for evaluation script of ruspelleval metric goes to Aleksei Sorokin and his notable [work](https://www.dialog-21.ru/media/3427/sorokinaaetal.pdf) 
in proceedings of [SpellRueval](https://www.dialog-21.ru/evaluation/2016/spelling_correction/). 

## Citation
If you want to know more about our work take a look at these publications:

💥 Our EACL 2024 [Paper](https://aclanthology.org/2024.findings-eacl.10/) provides a thorough description of the methodology used to obtain SOTA 
models for spelling corrections as well the comprehensive reports of all experiments that have been carried out. 

💫 While our Dialogue-2023 [Paper](https://www.dialog-21.ru/media/5914/martynovnplusetal056.pdf) focuses on exploiting 
resources for the task of spelling correction and procedures on obtaining high-quality parallel corpuses. 

```
@inproceedings{martynov-etal-2024-methodology,
    title = "A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages",
    author = "Martynov, Nikita  and
      Baushenko, Mark  and
      Kozlova, Anastasia  and
      Kolomeytseva, Katerina  and
      Abramov, Aleksandr  and
      Fenogenova, Alena",
    editor = "Graham, Yvette  and
      Purver, Matthew",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2024",
    month = mar,
    year = "2024",
    address = "St. Julian{'}s, Malta",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-eacl.10",
    pages = "138--155",
    abstract = "Large language models excel in text generation and generalization, however they face challenges in text editing tasks, especially in correcting spelling errors and mistyping.In this paper, we present a methodology for generative spelling correction (SC), tested on English and Russian languages and potentially can be extended to any language with minor changes. Our research mainly focuses on exploring natural spelling errors and mistyping in texts and studying how those errors can be emulated in correct sentences to enrich generative models{'} pre-train procedure effectively. We investigate the effects of emulations in various text domains and examine two spelling corruption techniques: 1) first one mimics human behavior when making a mistake through leveraging statistics of errors from a particular dataset, and 2) second adds the most common spelling errors, keyboard miss clicks, and some heuristics within the texts.We conducted experiments employing various corruption strategies, models{'} architectures, and sizes in the pre-training and fine-tuning stages and evaluated the models using single-domain and multi-domain test sets. As a practical outcome of our work, we introduce SAGE (Spell checking via Augmentation and Generative distribution Emulation).",
}

@inproceedings{martynov2023augmentation,
  title={Augmentation methods for spelling corruptions},
  author={Martynov, Nikita and Baushenko, Mark and Abramov, Alexander and Fenogenova, Alena},
  booktitle={Proceedings of the International Conference “Dialogue},
  volume={2023},
  year={2023}
}
```

📌 Feel free to ask any questions regarding our work at corresponding point of contact:

_nikita.martynov.98@list.ru_

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/orgs/ai-forever/sage",
    "name": "sage-spelling",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<=3.12.3,>=3.8.0",
    "maintainer_email": null,
    "keywords": "sage spelling correction nlp deep learning transformers pytorch",
    "author": "Nikita Martynov, Mark Baushenko, Alena Fenogenova and Alexandr Abramov",
    "author_email": "nikita.martynov.98@list.ru",
    "download_url": "https://files.pythonhosted.org/packages/a2/4a/c8e846dba75f5641e8b10a97fcfcaa49ad030a067d72283b37a0161480b0/sage_spelling-1.1.0.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\r\n  <picture>\r\n    <source media=\"(prefers-color-scheme: dark)\" srcset=\"images/sage-white.svg\">\r\n    <source media=\"(prefers-color-scheme: light)\" srcset=\"images/sage-black.svg\">\r\n    <img alt=\"SAGE\" src=\"images/sage-white.svg\" style=\"max-width: 100%;\">\r\n  </picture>\r\n</p>\r\n\r\n<p align=\"center\">\r\n    <a href=\"https://opensource.org/licenses/MIT\">\r\n    <img alt=\"License\" src=\"https://img.shields.io/badge/License-MIT-yellow.svg\">\r\n    </a>\r\n    <a href=\"https://github.com/ai-forever/sage/releases\">\r\n    <img alt=\"Release\" src=\"https://badgen.net/badge/release/v1.1.0/\">\r\n    </a>\r\n    <a href=\"https://arxiv.org/abs/2308.09435\">\r\n    <img alt=\"Paper\" src=\"https://img.shields.io/badge/arXiv-2308.09435-red\">\r\n    </a>\r\n    <a href='https://sage-documentation-main.readthedocs.io/en/latest/?badge=latest'>\r\n    <img src='https://readthedocs.org/projects/sage-documentation-main/badge/?version=latest' alt='Documentation Status' />\r\n    </a>\r\n</p>\r\n\r\n<h2 align=\"center\">\r\n    <p> Spelling correction, corruption and evaluation for multiple languages\r\n</p>\r\n</h2>\r\n\r\n<div align=\"center\">\r\n  <h4>\r\n    <a href=\"#installation\">Install</a> |\r\n    <a href=\"#spelling-correction\">Models</a> |\r\n    <a href=\"#evaluation\">Evaluation</a> |\r\n    <a href=\"#statistic-based-spelling-corruption-sbsc\">SBSC</a> |\r\n    <a href=\"#augmentex\">Augmentex</a> |\r\n    <a href=\"#citation\">Papers</a>\r\n  </h4>\r\n</div>\r\n\r\nSAGE (Spell checking via Augmentation and Generative distribution Emulation) is \r\na complete solution that you need when working on a spelling problem:\r\n\r\n- \ud83d\udcaf Spelling correction with State-of-the-art pre-trained \ud83e\udd17Transformer models:\r\n  - [sage-fredt5-large](https://huggingface.co/ai-forever/sage-fredt5-large)\r\n  - [sage-fredt5-distilled-95m](https://huggingface.co/ai-forever/sage-fredt5-distilled-95m)\r\n  - [sage-mt5-large](https://huggingface.co/ai-forever/sage-mt5-large)\r\n  - [sage-m2m100-1.2B](https://huggingface.co/ai-forever/sage-m2m100-1.2B)\r\n  - [T5-large](https://huggingface.co/ai-forever/T5-large-spell) \r\n  - [M2M100-1.2B](https://huggingface.co/ai-forever/RuM2M100-1.2B) [Earlier release]\r\n  - [M2M100-418M](https://huggingface.co/ai-forever/RuM2M100-418M) [Earlier release]\r\n  - [FredT5-large](https://huggingface.co/ai-forever/FRED-T5-large-spell) [Earlier release]\r\n\r\n  You can test them out right here [![Try Model Generation In Colab!](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ai-forever/sage/blob/main/notebooks/text_correction_demo.ipynb)\r\n- \ud83e\udde9 Augment your data with spelling corruption algorithms, take a look at a quick demo [![Try Model Generation In Colab!](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ai-forever/sage/blob/main/notebooks/text_corruption_demo.ipynb)\r\n- \ud83d\udcca Evaluate performance of spelling correction tools.\r\n\r\n## News\r\n\ud83d\udd25 **[2024-01-18]**: Our paper *\"A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages\"* is accepted for EACL 2024 conference!\r\n\r\n\ud83d\udca5 **[2024-04-11]**: SAGE v1.1.0 is finally out: a comprehensive note about the details of release can be found [here](https://habr.com/ru/companies/sberdevices/articles/806897/). \r\n\r\n## Table of contents\r\n\r\n- [Installation](#installation)\r\n  - [Regular install](#regular-install)\r\n  - [Editable install](#editable-install)\r\n- [Quick demo](#quick-demo)\r\n- [Spelling corruption](#spelling-corruption)\r\n  - [Statistic-based Spelling Corruption (SBSC)](#statistic-based-spelling-corruption-sbsc)\r\n  - [Augmentex](#augmentex)\r\n- [Spelling correction](#spelling-correction)\r\n  - [RUSpellRU evaluation](#ruspellru-evaluation)\r\n  - [MultidomainGold evaluation](#multidomaingold-evaluation)\r\n  - [MedSpellchecker evaluation](#medspellchecker-evaluation)\r\n  - [GitHubTypoCorpusRu evaluation](#githubtypocorpusru-evaluation)\r\n- [Evaluation](#evaluation)\r\n- [Citation](#citation)\r\n\r\n## Installation\r\n### Regular install\r\n```commandline\r\ngit clone https://github.com/ai-forever/sage.git\r\ncd sage\r\npip install .\r\n```\r\n\r\nTo install extra requirements that you are going to need when working with ERRANT-based metric run\r\n\r\n```commandline\r\npython -m spacy download ru_core_news_lg\r\npip install -e .[errant]\r\n```\r\n\r\n### Editable install\r\n```commandline\r\ngit clone https://github.com/ai-forever/sage.git\r\ncd sage\r\npip install -e .\r\n```\r\nand proceed with extra requirements install as above. \r\n\r\n## Quick demo\r\nLets spoil some text:\r\n```python\r\nimport sage\r\nfrom sage.spelling_corruption import SBSCConfig, SBSCCorruptor\r\nfrom sage.utils import DatasetsAvailable\r\n\r\ntext = \"\u0417\u0430\u043c\u0435\u0442\u044c\u0442\u0435, \u043d\u0435 \u044f \u044d\u0442\u043e \u043f\u0440\u0435\u0434\u043b\u043e\u0436\u0438\u043b!\"\r\n\r\n# Instantiate SBSC corruptor from a dataset with errors in medical anamnesis\r\nconfig = SBSCConfig(\r\n    reference_dataset_name_or_path=DatasetsAvailable.MedSpellchecker.name,\r\n    reference_dataset_split=\"test\"\r\n)\r\ncorruptor = SBSCCorruptor.from_config(config)\r\n\r\ncorruptor.corrupt(text, seed=1)\r\n# '\u0417\u0430\u0438\u0435\u0442\u044c\u0442\u0435, \u043d\u0435 \u044f \u044d\u0442 \u043e \u043f\u0440\u0435\u0434 \u043b\u043e\u0436\u0438\u043b!'\r\n```\r\n... now with Augmentex:\r\n\r\n```python\r\nimport sage\r\nfrom sage.spelling_corruption import WordAugConfig, WordAugCorruptor\r\n\r\ntext = \"\u0417\u0430\u043c\u0435\u0442\u044c\u0442\u0435, \u043d\u0435 \u044f \u044d\u0442\u043e \u043f\u0440\u0435\u0434\u043b\u043e\u0436\u0438\u043b!\"\r\n\r\n# Instantiate WordAugCorruptor corruptor with a custom set of parameters\r\nconfig = WordAugConfig(\r\n    min_aug=1,\r\n    max_aug=5,\r\n    unit_prob=0.4,\r\n)\r\ncorruptor = WordAugCorruptor.from_config(config)\r\n\r\ncorruptor.corrupt(text, seed=1)\r\n# '\u044d\u0442\u043e \u043d\u0435 \u043f\u0440\u0435\u0434\u043b\u043e\u0436\u0438\u043b! \u0417\u0430\u043c\u0435\u0442\u044c\u0442\u0435, \u044f'\r\n```\r\n\r\n... or for the English language:\r\n\r\n```python\r\nimport os\r\nfrom sage.spelling_corruption import SBSCConfig, SBSCCorruptor\r\n\r\ntext = \"Screw you guys, I am going home. (c)\"\r\n\r\n# Instantiate SBSC corruptor from a JFLEG dataset\r\nconfig = SBSCConfig(\r\n    lang=\"en\",\r\n    reference_dataset_name_or_path=os.path.join(\"data\", \"example_data\", \"jfleg\"),\r\n)\r\ncorruptor = SBSCCorruptor.from_config(config)\r\n\r\ncorruptor.corrupt(text, seed=1)\r\n# 'Screw you kuys, I am going home. (c)'\r\n```\r\n\r\nNow we can use our models to restore the initial text back:\r\n```python\r\nfrom sage.spelling_correction import AvailableCorrectors\r\nfrom sage.spelling_correction import RuM2M100ModelForSpellingCorrection, T5ModelForSpellingCorruption\r\n\r\ntext_ru = \"\u0417\u0430\u043c\u0442\u044c\u0442\u0435 \u043d\u0435 \u044f \u044d\u0442\u043e \u043f\u0440\u0435\u0434\u043b\u043e\u0436\u0438\u043b\"\r\ntext_en = \"Screw you kuys, I am going home. (c)\"\r\n\r\ncorrector_fred = T5ModelForSpellingCorruption.from_pretrained(AvailableCorrectors.sage_fredt5_large.value)\r\ncorrector_m2m = RuM2M100ModelForSpellingCorrection.from_pretrained(AvailableCorrectors.m2m100_1B.value)\r\ncorrector_en = T5ModelForSpellingCorruption.from_pretrained(AvailableCorrectors.ent5_large.value)\r\n\r\nprint(corrector_fred.correct(text_ru))\r\n# ['\u0417\u0430\u043c\u0435\u0442\u044c\u0442\u0435, \u043d\u0435 \u044f \u044d\u0442\u043e \u043f\u0440\u0435\u0434\u043b\u043e\u0436\u0438\u043b.']\r\n\r\nprint(corrector_m2m.correct(text_ru))\r\n# ['\u0417\u0430\u043c\u0435\u0442\u044c\u0442\u0435 \u043d\u0435 \u044f \u044d\u0442\u043e \u043f\u0440\u0435\u0434\u043b\u043e\u0436\u0438\u043b']\r\n\r\nprint(corrector_en.correct(text_en, prefix=\"grammar: \"))\r\n# ['Screw you guys, I am going home. (c)']\r\n```\r\n\r\nEvaluate performance of the models on open benchmarks for spelling correction:\r\n```python\r\nimport os\r\nimport torch\r\nfrom sage.utils import DatasetsAvailable\r\nfrom sage.spelling_correction import AvailableCorrectors\r\nfrom sage.spelling_correction import T5ModelForSpellingCorruption\r\n\r\ncorrector_fred_95m = T5ModelForSpellingCorruption.from_pretrained(AvailableCorrectors.sage_fredt5_distilled_95m.value)\r\ncorrector_mt5 = T5ModelForSpellingCorruption.from_pretrained(AvailableCorrectors.sage_mt5_large.value)\r\n\r\ncorrector_fred_95m.model.to(torch.device(\"cuda:0\"))\r\ncorrector_mt5.model.to(torch.device(\"cuda:0\"))\r\n\r\nmetrics = corrector_fred_95m.evaluate(\"RUSpellRU\", metrics=[\"errant\", \"ruspelleval\"], batch_size=32)\r\nprint(metrics)\r\n# {'CASE_Precision': 94.41, 'CASE_Recall': 92.55, 'CASE_F1': 93.47, 'SPELL_Precision': 77.52, 'SPELL_Recall': 64.09, 'SPELL_F1': 70.17, 'PUNCT_Precision': 86.77, 'PUNCT_Recall': 80.59, 'PUNCT_F1': 83.56, 'YO_Precision': 46.21, 'YO_Recall': 73.83, 'YO_F1': 56.84, 'Precision': 83.48, 'Recall': 74.75, 'F1': 78.87}\r\n\r\nmetrics = corrector_mt5.evaluate(\"/content/sage/data/example_data/jfleg\", metrics=[\"ruspelleval\"], batch_size=16)\r\nprint(metrics)\r\n# {'Precision': 75.94, 'Recall': 88.15, 'F1': 81.59}\r\n\r\n```\r\n\r\n_NOTE_: if you are launching code snippet in Colab you'd probably end up with MEMORY ERROR, so manage evaluation \r\nprocedures so that you meet available device's restrictions. As a feasible workaround you can execute \r\n```python\r\ndel corrector_fred_95m.model\r\n```\r\nto free some space. \r\n\r\n## Spelling Corruption\r\nWe implemented two methods for spelling corruption. **S**tatistic-**b**ased **S**pelling **C**orruption (**SBSC**) aims \r\nto mimic human behaviour when making an error. While [Augmentex](#augmentex) relies on rule-based heuristics and common\r\nerrors and mistypings especially those committed while typing text on a keyboard. \r\n\r\n\ud83d\ude80 Both methods proved their effectiveness for spelling correction systems and celebrated substantial **performance gains**\r\nfully reported in our [Paper](https://aclanthology.org/2024.findings-eacl.10/).\r\n\r\n### Statistic-based Spelling Corruption (SBSC)\r\nThis method is thoroughly described in our another [Paper](https://www.dialog-21.ru/media/5914/martynovnplusetal056.pdf) \r\nand in this \ud83d\udde3\ufe0f[Talk](https://youtu.be/yFfkV0Qjuu0?si=XmKfocCSLnKihxS_). \r\n\r\nBriefly, SBSC follows two simple steps:\r\n- \ud83e\udde0 Analyze errors, their type and positions in a source text;\r\n- \u270f\ufe0f Reproduce errors from the source text in a new sentence;\r\n\r\n\ud83e\udde0 To analyze errors in a source sentence we need its corresponding correction in order to build \r\n[Levenshtein matrix](https://en.wikipedia.org/wiki/Levenshtein_distance), traverse it back starting from the \r\nbottom right entry and determine the exact position and type of an error. We then aggregate all obtained statistics and \r\nnormalize it to valid discrete distributions. \r\n\r\n\u270f\ufe0f \"Reproduce\" step is even less complicated: we just sample number of errors per sentence, their types and relative\r\npositions from corresponding distributions and apply them to a correct sentence.\r\n\r\nAs stated, you need a parallel dataset to \"fit\" SBSC. We provide a set of four datasets with natural errors covering\r\nexhaustive range of domains:\r\n\r\n- **RUSpellRU**: texts collected from [LiveJournal](https://www.livejournal.com/media), with manually corrected typos and errors;\r\n- **MultidomainGold**: examples from 7 text sources, including the open web, news, social media, reviews, subtitles, policy documents and literary works;\r\n- **MedSpellChecker**: texts with errors from medical anamnesis;\r\n- **GitHubTypoCorpusRu**: spelling errors and typos in commits from GitHub;\r\n\r\nYou can use them as simple as\r\n```python\r\nimport sage\r\nfrom sage.spelling_corruption import SBSCConfig, SBSCCorruptor\r\nfrom sage.utils import DatasetsAvailable\r\n\r\n# Instantiate SBSC corruptor from a dataset with errors in medical anamnesis\r\nconfig = SBSCConfig(\r\n    reference_dataset_name_or_path=DatasetsAvailable.MedSpellchecker.name,\r\n    reference_dataset_split=\"test\"\r\n)\r\ncorruptor = SBSCCorruptor.from_config(config)\r\n```\r\n\r\n... or you can initialize your SBSC from locally stored dataset:\r\n```python\r\nimport os\r\nfrom sage.spelling_corruption import SBSCConfig, SBSCCorruptor\r\n\r\n# Instantiate SBSC corruptor from a JFLEG dataset\r\nconfig = SBSCConfig(\r\n    lang=\"en\",\r\n    reference_dataset_name_or_path=os.path.join(\"data\", \"example_data\", \"jfleg\"),\r\n)\r\ncorruptor = SBSCCorruptor.from_config(config)\r\n```\r\n\r\n\u2705 To check how good SBSC actually approximates original errors, you can plot side-by-side graphs of original and \r\nsynthetically generated distributions:\r\n\r\n<p align=\"center\">\r\n    <br>\r\n    <img src=\"images/ruspellru_side_by_side.jpg\" width=\"400\" style=\"float:center; padding-right:60px\"/> \r\n    <img src=\"images/bea60k_side_by_side.jpg\" width=\"400\" style=\"float:center; padding-left:60px\"/>\r\n    <br>\r\n<p>\r\n\r\nTo access these graphs you can simply\r\n```python\r\nfrom sage.utils import load_available_dataset_from_hf, draw_and_save_errors_distributions_comparison_charts\r\nfrom sage.spelling_corruption.sbsc.labeler import process_mistypings\r\nfrom sage.spelling_corruption import SBSCCorruptor\r\n\r\nsources, corrections = load_available_dataset_from_hf(\"RUSpellRU\", for_labeler=True, split=\"train\")\r\nruspellru_stats, ruspellru_confusion_matrix, ruspellru_typos_cnt = process_mistypings(sources, corrections)\r\n\r\ncorruptor = SBSCCorruptor.from_default_config()\r\nspoiled_sentences = corruptor.batch_corrupt(corrections)\r\n\r\nsbsc_stats, sbsc_confusion_matrix, sbsc_typos_cnt = process_mistypings(spoiled_sentences, corrections)\r\n\r\ndraw_and_save_errors_distributions_comparison_charts(\r\n    actual_typos_cnt = sbsc_typos_cnt,\r\n    reference_typos_cnt=ruspellru_typos_cnt,\r\n    actual_stats=sbsc_stats,\r\n    reference_stats=ruspellru_stats,\r\n    path_to_save=\"ruspellru_sbsc.jpg\"\r\n)\r\n```\r\n\r\n### Augmentex\r\nAugmentex introduces rule-based and common statistic (empowered by [KartaSlov](https://kartaslov.ru) project) \r\napproach to insert errors in text. It is fully described again in the [Paper](https://www.dialog-21.ru/media/5914/martynovnplusetal056.pdf)\r\nand in this \ud83d\udde3\ufe0f[Talk](https://youtu.be/yFfkV0Qjuu0?si=XmKfocCSLnKihxS_).\r\n\r\n\ud83d\udd87\ufe0f Augmentex allows you to operate on two levels of granularity when it comes to text corruption and offers you sets of \r\nspecific methods suited for particular level:\r\n- **Word level**:\r\n  - _replace_ - replace a random word with its incorrect counterpart;\r\n  - _delete_ - delete random word;\r\n  - _swap_ - swap two random words;\r\n  - _stopword_ - add random words from stop-list;\r\n  - _reverse_ - change a case of the first letter of a random word;\r\n- **Character level**:\r\n  - _shift_ - randomly swaps upper / lower case in a string;\r\n  - _orfo_ - substitute correct characters with their common incorrect counterparts;\r\n  - _typo_ - substitute correct characters as if they are mistyped on a keyboard;\r\n  - _delete_ - delete random character;\r\n  - _multiply_ - multiply random character;\r\n  - _swap_ - swap two adjacent characters;\r\n  - _insert_ - insert random character;\r\n\r\nTo access Augmentex you only need these few manipulations:\r\n```python\r\nfrom sage.spelling_corruption import CharAugConfig, CharAugCorruptor\r\n\r\nconfig = CharAugConfig(\r\n    unit_prob=0.3, # proportion of characters that is going to undergo edits\r\n    min_aug=1, # minimum number of edits\r\n    max_aug=5, # maximum number of edits \r\n    mult_num=3 # `multiply` edit\r\n)\r\ncorruptor = CharAugCorruptor.from_config(config)\r\n```\r\n\r\n... or like this:\r\n\r\n```python\r\nfrom sage.spelling_corruption import WordAugConfig, WordAugCorruptor\r\n\r\nconfig = WordAugConfig(\r\n    unit_prob=0.4, # proportion of characters that is going to undergo edits\r\n    min_aug=1, # minimum number of edits\r\n    max_aug=5, # maximum number of edits \r\n)\r\ncorruptor = WordAugCorruptor.from_config(config)\r\n```\r\n\r\nAugmentex has been created by our fellow team, the project has its own [repo](https://github.com/ai-forever/augmentex), do not forget to take a look! \r\n\r\n## Spelling Correction\r\nOur methodology for obtaining model with optimal performance on spellchecking task is thoroughly described in our\r\n[Paper](https://aclanthology.org/2024.findings-eacl.10/). And the algorithm is simple and generally consists of two steps:\r\n\r\n- Pre-train model on extensive parallel corpus with synthetically generated errors;\r\n- Fine-tune on combinations of available datasets for spelling correction with \"human-made\" errors;\r\n\r\nWe use [Augmentex](#augmentex) and [SBSC](#statistic-based-spelling-corruption-sbsc) for both generating large synthetic corpora and augmenting datasets with natural errors. \r\nThe family of pre-trained correctors now amounts for 8 models.\r\n\r\nWe've 6 \ud83e\udd17Transformer models for Russian \ud83c\uddf7\ud83c\uddfa:\r\n  - [sage-fredt5-large](https://huggingface.co/ai-forever/sage-fredt5-large)\r\n  - [sage-fredt5-distilled-95m](https://huggingface.co/ai-forever/sage-fredt5-distilled-95m)\r\n  - [sage-m2m100-1.2B](https://huggingface.co/ai-forever/sage-m2m100-1.2B)\r\n  - [M2M100-1.2B](https://huggingface.co/ai-forever/RuM2M100-1.2B) [Earlier release]\r\n  - [M2M100-418M](https://huggingface.co/ai-forever/RuM2M100-418M) [Earlier release]\r\n  - [FredT5-large](https://huggingface.co/ai-forever/FRED-T5-large-spell) [Earlier release]\r\n\r\nAnd two models for English \ud83c\uddec\ud83c\udde7:\r\n  - [T5-large](https://huggingface.co/ai-forever/T5-large-spell)\r\n  - [sage-mt5-large](https://huggingface.co/ai-forever/sage-mt5-large)\r\n\r\nModels for the Russian language have been pre-trained on combination of Russian Wikipedia and videos transcriptions with \r\nartificial errors generated by [SBSC](#statistic-based-spelling-corruption-sbsc) on statistics gathered from train split of [RUSpellRU](https://huggingface.co/datasets/ai-forever/spellcheck_benchmark). \r\nCorrectors for English trained on mixture of English Wikipedia articles and news posts with synthetic errors inserted by [SBSC](#statistic-based-spelling-corruption-sbsc) fitted on statistics from 5k subsample\r\nof [BEA60k](https://github.com/neuspell/neuspell/tree/master).\r\n\r\n\ud83d\udcda We also validate our solutions on available datasets with \"human-made\" errors:\r\n\r\n- **RUSpellRU**: texts collected from [LiveJournal](https://www.livejournal.com/media), with manually corrected typos and errors;\r\n- **MultidomainGold**: examples from 7 text sources, including the open web, news, social media, reviews, subtitles, policy documents and literary works;\r\n- **MedSpellChecker**: texts with errors from medical anamnesis;\r\n- **GitHubTypoCorpusRu**: spelling errors and typos in commits from GitHub;\r\n- **BEA60K**: English spelling errors collected from several domains;\r\n- **JFLEG**: 1601 sentences in English, which contain about 2 thousand spelling errors;\r\n\r\n\ud83d\udcc8 Here we report evaluation of some setups:\r\n- Zero-shot evaluation of pre-trained checkpoints;\r\n- Additional fine-tuning (ft.) on the target dataset;\r\n\r\nFull list of setups and corresponding performances are in the [Paper](https://aclanthology.org/2024.findings-eacl.10/).\r\n\r\n**RUSpellRU**, **MultidomainGold**, **MedSpellChecker** and **GitHubTypoCorpusRu** come from [spellcheck_punctuation_benchmark](https://huggingface.co/datasets/ai-forever/spellcheck_punctuation_benchmark).\r\nThe benchmark accounts for both punctuation and spelling errors. For the simplicity and better representativeness we report results only for those models\r\n([sage-fredt5-large](https://huggingface.co/ai-forever/sage-fredt5-large), [sage-fredt5-distilled-95m](https://huggingface.co/ai-forever/sage-fredt5-distilled-95m)) that deal with both types of errors (the Russian language).\r\nThe detailed metrics for other checkpoints can be found either in the [Paper](https://aclanthology.org/2024.findings-eacl.10/), [post](\u0441\u0441\u044b\u043b\u043a\u0430 \u043d\u0430 \u043d\u043e\u0432\u044b\u0439 \u0445\u0430\u0431\u0440) or corresponding model card. \r\n\r\n_NOTE:_ **MedSpellChecker** and **GitHubTypoCorpusRu** do not have train split, so their performance on \r\n**Pre-train + fine-tune** setup is reported as a result of fine-tuning on combination of **RUSpellRU** and **MultidomainGold**\r\ndatasets.\r\n\r\n#### RUSpellRU Evaluation\r\n\r\n| Model                      | Pr. (spell) | Rec. (spell) | F1 (spell) | Pr. (punc) | Rec. (punc) | F1 (punc) | Pr. (case) | Rec. (case) | F1 (case) |\r\n|----------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | \r\n| sage-ai-service            | 90.3 | 86.3 | 88.2 | 90.3 | 86.6 | 88.4 | 95.2 | 95.9 | 95.6 |\r\n| sage-fredt5-large          | 57.3 | 68.0 | 62.2 | 86.7 | 46.1 | 60.2 | 92.1 | 67.8 | 78.1 |\r\n| sage-fredt5-large (ft.)    | 88.4 | 80.9 | 84.5 | 88.2 | 85.3 | 86.8 | 95.5 | 94.0 | 94.7 |\r\n| sage-fredt5-distilled-95m (ft.)  | 83.5 | 74.8 | 78.9 | 86.8 | 80.6 | 83.6 | 94.4 | 92.5 | 93.5 |\r\n| gpt-3.5-turbo              | 33.6 | 58.5 | 42.7 | 85.9 | 64.6 | 73.7 | 84.9 | 73.9 | 79.0 |\r\n| gpt-4                      | 54.9 | 76.7 | 64.0 | 84.0 | 82.3 | 83.2 | 91.5 | 90.2 | 90.9 |\r\n\r\n\r\n#### MultidomainGold Evaluation\r\n\r\n| Model                           | Pr. (spell) | Rec. (spell) | F1 (spell) | Pr. (punc) | Rec. (punc) | F1 (punc) | Pr. (case) | Rec. (case) | F1 (case) |\r\n|---------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | \r\n| sage-ai-service                 | 81.6 | 77.7 | 79.6 | 70.2 | 67.5 | 68.8 | 80.5 | 80.5 | 80.5 |\r\n| sage-fredt5-large               | 43.4 | 49.7 | 46.3 | 21.8 | 21.3 | 21.6 | 58.8 | 23.9 | 34.0 |\r\n| sage-fredt5-large (ft.)         | 80.3 | 75.1 | 77.6 | 69.0 | 66.5 | 67.7 | 78.6 | 80.0 | 79.3 |\r\n| sage-fredt5-distilled-95m (ft.) | 77.2 | 69.9 | 73.4 | 66.8 | 63.4 | 65.0 | 76.8 | 79.1 | 77.9 |\r\n| gpt-3.5-turbo                   | 18.8 | 48.1 | 27.1 | 42.0 | 31.8 | 36.2 | 47.1 | 51.3 | 49.1 |\r\n| gpt-4                           | 25.4 | 68.0 | 37.0 | 57.8 | 54.3 | 56.0 | 54.0 | 67.5 | 60.0 |\r\n\r\n\r\n\r\n#### MedSpellchecker Evaluation\r\n\r\n| Model                           | Pr. (spell) | Rec. (spell) | F1 (spell) | Pr. (punc) | Rec. (punc) | F1 (punc) | Pr. (case) | Rec. (case) | F1 (case) |\r\n|---------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | \r\n| sage-ai-service                 | 71.3 | 73.5 | 72.4 | 75.1 | 69.2 | 72.0 | 80.9 | 72.8 | 76.6|\r\n| sage-fredt5-large               | 35.2 | 54.5 | 42.8 | 19.2 | 13.2 | 15.7 | 48.7 | 36.8 | 41.9 |\r\n| sage-fredt5-large (ft.)         | 72.5 | 72.2 | 72.3 | 74.6 | 66.4 | 70.3 | 79.3 | 85.1 | 82.1 |\r\n| sage-fredt5-distilled-95m (ft.) | 65.1 | 64.8 | 64.9 | 78.6 | 63.1 | 70.0 | 63.5 | 74.7 | 68.7 |\r\n| gpt-3.5-turbo                   | 14.7 | 45.9 | 22.3 | 69.9 | 52.3 | 59.8 | 26.4 | 41.8 | 32.3 |\r\n| gpt-4                           | 37.8 | 72.3 | 49.6 | 81.4 | 64.3 | 71.9 | 73.0 | 62.1 | 67.1 |\r\n\r\n\r\n#### GitHubTypoCorpusRu Evaluation\r\n\r\n| Model                           | Pr. (spell) | Rec. (spell) | F1 (spell) | Pr. (punc) | Rec. (punc) | F1 (punc) | Pr. (case) | Rec. (case) | F1 (case) |\r\n|---------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | \r\n| sage-ai-service                 | 70.8 | 56.3 | 62.7 | 48.9 | 35.8 | 41.4 | 32.9 | 45.3 | 38.1|\r\n| sage-fredt5-large               | 46.0 | 46.6 | 46.3 | 22.7 | 18.3 | 20.2 | 12.0 | 13.2 | 12.6 |\r\n| sage-fredt5-large (ft.)         | 67.5 | 53.2 | 59.5 | 48.5  | 38.0 | 42.6 | 37.3 | 50.0 | 42.7 |\r\n| sage-fredt5-distilled-95m (ft.) | 57.8 | 48.5 | 52.7 | 45.2 | 39.5 | 42.1 | 29.9 | 46.2 | 36.3 |\r\n| gpt-3.5-turbo                   | 23.7 | 38.7 | 29.4 | 37.6 | 23.3 | 28.7 | 19.6 | 35.9 | 25.3 |\r\n| gpt-4                           | 27.0 | 52.8 | 35.7 | 45.9 | 32.6 | 38.2 | 25.7 | 36.8 | 30.2 |\r\n\r\n\r\n#### BEA60K Evaluation\r\n\r\n| Model | Precision | Recall | F1 |\r\n| --- | --- | --- | --- |\r\n| sage-mt5-large | 64.7 | 83.8 | 73.0 | \r\n| T5-large-spell | 66.5 | 83.1 | 73.9 |\r\n| gpt-3.5-turbo |  66.9 | 84.1 | 74.5 | \r\n| gpt-4 | 68.6 | 85.2 | 76.0 | \r\n| [Bert](https://github.com/neuspell/neuspell) | 65.8 | 79.6 | 72.0 |\r\n| [SC-LSTM](https://github.com/neuspell/neuspell) | 62.2 | 80.3 | 72.0 |\r\n\r\n\r\n#### JFLEG Evaluation\r\n\r\n| Model | Precision | Recall | F1 |\r\n| --- | --- | --- | --- |\r\n| sage-mt5-large | 74.9 | 88.4 | 81.1 |\r\n| T5-large-spell | 83.4 | 84.3 | 83.8 |\r\n| gpt-3.5-turbo |  77.8 | 88.6 | 82.9 | \r\n| gpt-4 | 77.9 | 88.3 | 82.8 |\r\n| [Bert](https://github.com/neuspell/neuspell) | 78.5 | 85.4 | 81.8 |\r\n| [SC-LSTM](https://github.com/neuspell/neuspell) | 80.6 | 86.1 | 83.2 |\r\n\r\n**RUSpellRU**, **MultidomainGold**, **MedSpellChecker** and **GitHubTypoCorpusRu** are available as HuggingFace datasets [here](https://huggingface.co/datasets/ai-forever/spellcheck_punctuation_benchmark) and through the API of our library: \r\n```python\r\nfrom sage.utils import load_available_dataset_from_hf, DatasetsAvailable\r\n\r\nprint([dataset.name for dataset in DatasetsAvailable])\r\n# ['MultidomainGold', 'RUSpellRU', 'MedSpellchecker', 'GitHubTypoCorpusRu', 'MultidomainGold_orth', 'RUSpellRU_orth', 'MedSpellchecker_orth', 'GitHubTypoCorpusRu_orth']\r\n\r\ngold_dataset = load_available_dataset_from_hf(DatasetsAvailable.MultidomainGold.name, for_labeler=False)\r\nprint(len(gold_dataset))\r\n# 7675\r\n\r\nsources, corrections = load_available_dataset_from_hf(DatasetsAvailable.RUSpellRU.name, for_labeler=True, split=\"train\")\r\nprint(len(sources), len(corrections))\r\n# 2000 2000\r\n```\r\n\r\n## Evaluation\r\nWe also provide functionality to evaluate the performance of spelling correction systems and rank them. \r\n\r\n\ud83c\udfaf Currently two options are available:\r\n- [ruspelleval](https://www.dialog-21.ru/media/3427/sorokinaaetal.pdf);\r\n- [ERRANT](https://github.com/chrisjbryant/errant)-based metric adapted for the Russian language;\r\n\r\nBoth algorithms output Precision, Recall and F1 scores that can be interpreted like the following:\r\n- **Precision**: one minus share of unnecessary amendments; \r\n- **Recall**: proportion of expected corrections;\r\n- **F1**: famous geometric mean of aforementioned two;\r\n\r\nYou can obtain these metrics simply by\r\n```python\r\nfrom sage.evaluation import Scorer\r\nfrom sage.utils import DatasetsAvailable, load_available_dataset_from_hf\r\n\r\nsources, corrections = load_available_dataset_from_hf(DatasetsAvailable.RUSpellRU.name, for_labeler=True, split=\"test\")\r\n\r\nscorer = Scorer()\r\nmetrics = scorer.score(sources, corrections, corrections, metrics=[\"ruspelleval\", \"errant\"])\r\nprint(metrics)\r\n# {'Precision': 100.0, 'Recall': 100.0, 'F1': 100.0, 'CASE_Precision': 100.0, 'CASE_Recall': 100.0, 'CASE_F1': 100.0, 'SPELL_Precision': 100.0, 'SPELL_Recall': 100.0, 'SPELL_F1': 100.0, 'PUNCT_Precision': 100.0, 'PUNCT_Recall': 100.0, 'PUNCT_F1': 100.0, 'YO_Precision': 100.0, 'YO_Recall': 100.0, 'YO_F1': 100.0}\r\n```\r\n\r\n... or by directly assessing the model:\r\n```python\r\nimport os\r\nimport torch\r\nfrom sage.utils import DatasetsAvailable\r\nfrom sage.spelling_correction import AvailableCorrectors\r\nfrom sage.spelling_correction import T5ModelForSpellingCorruption\r\n\r\ncorrector_fred_95m = T5ModelForSpellingCorruption.from_pretrained(AvailableCorrectors.sage_fredt5_distilled_95m.value)\r\ncorrector_mt5 = T5ModelForSpellingCorruption.from_pretrained(AvailableCorrectors.sage_mt5_large.value)\r\n\r\ncorrector_fred_95m.model.to(torch.device(\"cuda:0\"))\r\ncorrector_mt5.model.to(torch.device(\"cuda:0\"))\r\n\r\nmetrics = corrector_fred_95m.evaluate(\"RUSpellRU\", metrics=[\"errant\", \"ruspelleval\"], batch_size=32)\r\nprint(metrics)\r\n# {'CASE_Precision': 94.41, 'CASE_Recall': 92.55, 'CASE_F1': 93.47, 'SPELL_Precision': 77.52, 'SPELL_Recall': 64.09, 'SPELL_F1': 70.17, 'PUNCT_Precision': 86.77, 'PUNCT_Recall': 80.59, 'PUNCT_F1': 83.56, 'YO_Precision': 46.21, 'YO_Recall': 73.83, 'YO_F1': 56.84, 'Precision': 83.48, 'Recall': 74.75, 'F1': 78.87}\r\n\r\nmetrics = corrector_mt5.evaluate(\"/content/sage/data/example_data/jfleg\", metrics=[\"ruspelleval\"], batch_size=16)\r\nprint(metrics)\r\n# {'Precision': 75.94, 'Recall': 88.15, 'F1': 81.59}\r\n```\r\n\r\nThe metrics output by ERRANT based algorithm are indicated by the corresponding prefix, which refers to the specific type of errors:\r\n- *CASE*: erroneously used case;\r\n- *SPELL*: spelling and grammar errors;\r\n- *PUNCT*: punctuation errors;\r\n- *YO*: unnecessary replacement of \"YO\" (\u0451) letter;\r\n\r\n\ud83d\udccc Credit for evaluation script of ruspelleval metric goes to Aleksei Sorokin and his notable [work](https://www.dialog-21.ru/media/3427/sorokinaaetal.pdf) \r\nin proceedings of [SpellRueval](https://www.dialog-21.ru/evaluation/2016/spelling_correction/). \r\n\r\n## Citation\r\nIf you want to know more about our work take a look at these publications:\r\n\r\n\ud83d\udca5 Our EACL 2024 [Paper](https://aclanthology.org/2024.findings-eacl.10/) provides a thorough description of the methodology used to obtain SOTA \r\nmodels for spelling corrections as well the comprehensive reports of all experiments that have been carried out. \r\n\r\n\ud83d\udcab While our Dialogue-2023 [Paper](https://www.dialog-21.ru/media/5914/martynovnplusetal056.pdf) focuses on exploiting \r\nresources for the task of spelling correction and procedures on obtaining high-quality parallel corpuses. \r\n\r\n```\r\n@inproceedings{martynov-etal-2024-methodology,\r\n    title = \"A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages\",\r\n    author = \"Martynov, Nikita  and\r\n      Baushenko, Mark  and\r\n      Kozlova, Anastasia  and\r\n      Kolomeytseva, Katerina  and\r\n      Abramov, Aleksandr  and\r\n      Fenogenova, Alena\",\r\n    editor = \"Graham, Yvette  and\r\n      Purver, Matthew\",\r\n    booktitle = \"Findings of the Association for Computational Linguistics: EACL 2024\",\r\n    month = mar,\r\n    year = \"2024\",\r\n    address = \"St. Julian{'}s, Malta\",\r\n    publisher = \"Association for Computational Linguistics\",\r\n    url = \"https://aclanthology.org/2024.findings-eacl.10\",\r\n    pages = \"138--155\",\r\n    abstract = \"Large language models excel in text generation and generalization, however they face challenges in text editing tasks, especially in correcting spelling errors and mistyping.In this paper, we present a methodology for generative spelling correction (SC), tested on English and Russian languages and potentially can be extended to any language with minor changes. Our research mainly focuses on exploring natural spelling errors and mistyping in texts and studying how those errors can be emulated in correct sentences to enrich generative models{'} pre-train procedure effectively. We investigate the effects of emulations in various text domains and examine two spelling corruption techniques: 1) first one mimics human behavior when making a mistake through leveraging statistics of errors from a particular dataset, and 2) second adds the most common spelling errors, keyboard miss clicks, and some heuristics within the texts.We conducted experiments employing various corruption strategies, models{'} architectures, and sizes in the pre-training and fine-tuning stages and evaluated the models using single-domain and multi-domain test sets. As a practical outcome of our work, we introduce SAGE (Spell checking via Augmentation and Generative distribution Emulation).\",\r\n}\r\n\r\n@inproceedings{martynov2023augmentation,\r\n  title={Augmentation methods for spelling corruptions},\r\n  author={Martynov, Nikita and Baushenko, Mark and Abramov, Alexander and Fenogenova, Alena},\r\n  booktitle={Proceedings of the International Conference \u201cDialogue},\r\n  volume={2023},\r\n  year={2023}\r\n}\r\n```\r\n\r\n\ud83d\udccc Feel free to ask any questions regarding our work at corresponding point of contact:\r\n\r\n_nikita.martynov.98@list.ru_\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "SAGE: Spell checking via Augmentation and  Generative distribution Emulation",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://github.com/orgs/ai-forever/sage"
    },
    "split_keywords": [
        "sage",
        "spelling",
        "correction",
        "nlp",
        "deep",
        "learning",
        "transformers",
        "pytorch"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "83d78c7e4971d6342d022b9dfa168ce61b2cdc8f65d267a1dd6361629104c76e",
                "md5": "47536a1ce92565aed79d88e28a9eb21b",
                "sha256": "61d3cc79bb8a73d96dc31f5cb3357939d207acef5dd9962ce1ecbe657c1c56cc"
            },
            "downloads": -1,
            "filename": "sage_spelling-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "47536a1ce92565aed79d88e28a9eb21b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<=3.12.3,>=3.8.0",
            "size": 47686,
            "upload_time": "2024-07-21T11:44:44",
            "upload_time_iso_8601": "2024-07-21T11:44:44.344272Z",
            "url": "https://files.pythonhosted.org/packages/83/d7/8c7e4971d6342d022b9dfa168ce61b2cdc8f65d267a1dd6361629104c76e/sage_spelling-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a24ac8e846dba75f5641e8b10a97fcfcaa49ad030a067d72283b37a0161480b0",
                "md5": "d1185518540e924d7616f1e6631dcb32",
                "sha256": "d3c30a09d09964f556ad65c5e27a8f205c8c46600c58c8862dfec48f1f4f6f9e"
            },
            "downloads": -1,
            "filename": "sage_spelling-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "d1185518540e924d7616f1e6631dcb32",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<=3.12.3,>=3.8.0",
            "size": 62807,
            "upload_time": "2024-07-21T11:44:46",
            "upload_time_iso_8601": "2024-07-21T11:44:46.327401Z",
            "url": "https://files.pythonhosted.org/packages/a2/4a/c8e846dba75f5641e8b10a97fcfcaa49ad030a067d72283b37a0161480b0/sage_spelling-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-21 11:44:46",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "orgs",
    "github_project": "ai-forever",
    "github_not_found": true,
    "lcname": "sage-spelling"
}

Nikita Martynov, Mark Baushenko, Alena Fenogenova and Alexandr Abramov