pycorrector

Name	pycorrector JSON
Version	1.1.3 JSON
	download
home_page	https://github.com/shibing624/pycorrector
Summary	Chinese Text Error Corrector
upload_time	2025-07-09 02:40:33
maintainer	None
docs_url	None
author	XuMing
requires_python	>=3.6
license	Apache 2.0
keywords	pycorrector correction chinese error correction nlp
VCS
bugtrack_url
requirements	jieba pypinyin transformers datasets numpy pandas six loguru pyahocorasick
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [**🇨🇳中文**](https://github.com/shibing624/pycorrector/blob/master/README.md) | [**🌐English**](https://github.com/shibing624/pycorrector/blob/master/README_EN.md) | [**📖文档/Docs**](https://github.com/shibing624/pycorrector/wiki) | [**🤖模型/Models**](https://huggingface.co/shibing624) 

<div align="center">
  <a href="https://github.com/shibing624/pycorrector">
    <img src="https://github.com/shibing624/pycorrector/blob/master/docs/pycorrector.png" alt="Logo" height="156">
  </a>
</div>

-----------------

# pycorrector: useful python text correction toolkit
[![PyPI version](https://badge.fury.io/py/pycorrector.svg)](https://badge.fury.io/py/pycorrector)
[![Downloads](https://static.pepy.tech/badge/pycorrector)](https://pepy.tech/project/pycorrector)
[![GitHub contributors](https://img.shields.io/github/contributors/shibing624/pycorrector.svg)](https://github.com/shibing624/pycorrector/graphs/contributors)
[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
[![python_vesion](https://img.shields.io/badge/Python-3.8%2B-green.svg)](requirements.txt)
[![GitHub issues](https://img.shields.io/github/issues/shibing624/pycorrector.svg)](https://github.com/shibing624/pycorrector/issues)
[![Wechat Group](https://img.shields.io/badge/wechat-group-green.svg?logo=wechat)](#Contact)


**pycorrector**: 中文文本纠错工具。支持中文音似、形似、语法错误纠正，python3.8开发。

**pycorrector**实现了Kenlm、ConvSeq2Seq、BERT、MacBERT、ELECTRA、ERNIE、GPT等多种模型的文本纠错，评估各模型的效果。

**Guide**

- [Features](#Features)
- [Evaluation](#Evaluation)
- [Usage](#usage)
- [Dataset](#Dataset)
- [Contact](#Contact)
- [References](#references)

## Introduction

中文文本纠错任务，常见错误类型：

<img src="https://github.com/shibing624/pycorrector/blob/master/docs/git_image/error_type.png" width="600" />

当然，针对不同业务场景，这些问题并不一定全部存在，比如拼音输入法、语音识别校对关注音似错误；五笔输入法、OCR校对关注形似错误，
搜索引擎query纠错关注所有错误类型。

本项目重点解决其中的"音似、形字、语法、专名错误"等类型。

## News
[2025/07/08] v1.1.2版本：支持了基于Qwen3的中文文本纠错模型[twnlp/ChineseErrorCorrector3-4B](https://huggingface.co/twnlp/ChineseErrorCorrector3-4B)，支持多字、少字、错字、词序、语法等错误纠正。详见[Release-v1.1.2](https://github.com/shibing624/pycorrector/releases/tag/1.1.2)

[2024/10/14] v1.1.0版本：新增了基于Qwen2.5的中文文本纠错模型，支持多字、少字、错字、词序、语法等错误纠正，发布了[shibing624/chinese-text-correction-1.5b](https://huggingface.co/shibing624/chinese-text-correction-1.5b)和[shibing624/chinese-text-correction-7b](https://huggingface.co/shibing624/chinese-text-correction-7b)模型，及其对应的LoRA模型。详见[Release-v1.1.0](https://github.com/shibing624/pycorrector/releases/tag/1.1.0)

[2023/11/07] v1.0.0版本：新增了ChatGLM3/LLaMA2等GPT模型用于中文文本纠错，发布了基于ChatGLM3-6B的[shibing624/chatglm3-6b-csc-chinese-lora](https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora)拼写和语法纠错模型；重写了DeepContext、ConvSeq2Seq、T5等模型的实现。详见[Release-v1.0.0](https://github.com/shibing624/pycorrector/releases/tag/1.0.0)


## Features

* [Kenlm模型](https://github.com/shibing624/pycorrector/tree/master/examples/kenlm)：本项目基于Kenlm统计语言模型工具训练了中文NGram语言模型，结合规则方法、混淆集可以纠正中文拼写错误，方法速度快，扩展性强，效果一般
* [DeepContext模型](https://github.com/shibing624/pycorrector/tree/master/examples/deepcontext)：本项目基于PyTorch实现了用于文本纠错的DeepContext模型，该模型结构参考Stanford University的NLC模型，2014英文纠错比赛得第一名，效果一般
* [Seq2Seq模型](https://github.com/shibing624/pycorrector/tree/master/examples/seq2seq)：本项目基于PyTorch实现了用于中文文本纠错的ConvSeq2Seq模型，该模型在NLPCC-2018的中文语法纠错比赛中，使用单模型并取得第三名，可以并行训练，模型收敛快，效果一般
* [T5模型](https://github.com/shibing624/pycorrector/tree/master/examples/t5)：本项目基于PyTorch实现了用于中文文本纠错的T5模型，使用Langboat/mengzi-t5-base的预训练模型finetune中文纠错数据集，模型改造的潜力较大，效果好
* [ERNIE_CSC模型](https://github.com/shibing624/pycorrector/tree/master/examples/ernie_csc)：本项目基于PaddlePaddle实现了用于中文文本纠错的ERNIE_CSC模型，模型在ERNIE-1.0上finetune，模型结构适配了中文拼写纠错任务，效果好
* [MacBERT模型](https://github.com/shibing624/pycorrector/tree/master/examples/macbert)【推荐】：本项目基于PyTorch实现了用于中文文本纠错的MacBERT4CSC模型，模型加入了错误检测和纠正网络，适配中文拼写纠错任务，效果好
* [MuCGECBart模型](https://modelscope.cn/models/iic/nlp_bart_text-error-correction_chinese/summary)：本项目基于ModelScope实现了用于文本纠错的Seq2Seq方法的MuCGECBart模型，该模型中文文本纠错效果较好
* [NaSGECBart模型](https://github.com/HillZhang1999/NaSGEC): MuCGECBart的同作者模型，无需modelscope依赖，基于中文母语纠错数据集NaSGEC在Bart模型上微调训练得到，效果好
* [GPT模型](https://github.com/shibing624/pycorrector/tree/master/examples/gpt)：本项目基于PyTorch实现了用于中文文本纠错的ChatGLM/LLaMA模型，模型在中文CSC和语法纠错数据集上finetune，适配中文文本纠错任务，效果很好



- 延展阅读：[中文文本纠错实践和原理解读](https://github.com/shibing624/pycorrector/blob/master/docs/correction_solution.md)
## Demo

- Official demo: https://www.mulanai.com/product/corrector/

- Colab online demo: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1zvSyCdiLK_rglfXcIgc539K_Z7bIMpu0?usp=sharing)

- HuggingFace demo: https://huggingface.co/spaces/shibing624/pycorrector

![](https://github.com/shibing624/pycorrector/blob/master/docs/hf.png)

run example: [examples/macbert/gradio_demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/macbert/gradio_demo.py) to see the demo:
```shell
python examples/macbert/gradio_demo.py
```

## Evaluation

评估脚本[examples/evaluate_models/evaluate_models.py](https://github.com/shibing624/pycorrector/blob/master/examples/evaluate_models/evaluate_models.py)：

- 评测集：SIGHAN-2015([sighan2015_test.tsv](https://github.com/shibing624/pycorrector/blob/master/pycorrector/data/sighan2015_test.tsv))、
EC-LAW([ec_law_test.tsv](https://github.com/shibing624/pycorrector/blob/master/examples/data/ec_law_test.tsv))、MCSC([mcsc_test.tsv](https://github.com/shibing624/pycorrector/blob/master/examples/data/mcsc_test.tsv))
- 评估标准：纠错准召率，采用严格句子粒度（Sentence Level）计算方式，把模型纠正之后的与正确句子完成相同的视为正确，否则为错

### 评估结果
- 评估指标：F1
- CSC(Chinese Spelling Correction): 拼写纠错模型，表示模型可以处理音似、形似、语法等长度对齐的错误纠正
- CTC(CHinese Text Correction): 文本纠错模型，表示模型支持拼写、语法等长度对齐的错误纠正，还可以处理多字、少字等长度不对齐的错误纠正
- GPU：Tesla V100，显存 32 GB

| Model Name       | Model Link                                                                                                              | Base Model                     | Avg        | SIGHAN-2015 | EC-LAW | MCSC   | GPU | QPS     |
|:-----------------|:------------------------------------------------------------------------------------------------------------------------|:-------------------------------|:-----------|:------------|:-------|:-------|:--------|:--------|
| Kenlm-CSC        | [shibing624/chinese-kenlm-klm](https://huggingface.co/shibing624/chinese-kenlm-klm)                                     | kenlm                          | 0.3409     | 0.3147      | 0.3763 | 0.3317 | CPU     | 9       |
| Mengzi-T5-CSC    | [shibing624/mengzi-t5-base-chinese-correction](https://huggingface.co/shibing624/mengzi-t5-base-chinese-correction)     | mengzi-t5-base                 | 0.3984     | 0.7758      | 0.3156 | 0.1039 | GPU     | 214     |
| ERNIE-CSC        | [PaddleNLP/ernie-csc](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/legacy/examples/text_correction/ernie-csc) | PaddlePaddle/ernie-1.0-base-zh | 0.4353     | 0.8383      | 0.3357 | 0.1318 | GPU     | 114     |
| MacBERT-CSC      | [shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese)                       | hfl/chinese-macbert-base       | 0.3993     | 0.8314      | 0.1610 | 0.2055 | GPU     | **224** |
| ChatGLM3-6B-CSC  | [shibing624/chatglm3-6b-csc-chinese-lora](https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora)               | THUDM/chatglm3-6b              | 0.4538     | 0.6572      | 0.4369 | 0.2672 | GPU     | 3       |
| Qwen2.5-1.5B-CTC | [shibing624/chinese-text-correction-1.5b](https://huggingface.co/shibing624/chinese-text-correction-1.5b)               | Qwen/Qwen2.5-1.5B-Instruct     | 0.6802     | 0.3032      | 0.7846 | 0.9529 | GPU     | 6       |
| Qwen2.5-7B-CTC   | [shibing624/chinese-text-correction-7b](https://huggingface.co/shibing624/chinese-text-correction-7b)                   | Qwen/Qwen2.5-7B-Instruct       | **0.8225** | 0.4917      | 0.9798 | 0.9959 | GPU     | 3       |
| Qwen3-4B-CTC     | [twnlp/ChineseErrorCorrector3-4B](https://huggingface.co/twnlp/ChineseErrorCorrector3-4B)                   | Qwen/Qwen3-4B                  | **0.7792** | 0.5270      | 0.8115 | 0.9990 | GPU     | 5       |


## Install

```shell
pip install -U pycorrector
```

or

```shell
pip install -r requirements.txt

git clone https://github.com/shibing624/pycorrector.git
cd pycorrector
pip install --no-deps .
```


通过以上两种方法的任何一种完成安装都可以。如果不想安装依赖包，可以拉docker环境。

* docker使用

```shell
docker run -it -v ~/.pycorrector:/root/.pycorrector shibing624/pycorrector:0.0.2
```

## Usage
本项目的初衷之一是比对、调研各种中文文本纠错方法，抛砖引玉。

项目实现了kenlm、macbert、seq2seq、 ernie_csc、T5、deepcontext、GPT(Qwen/ChatGLM)等模型应用于文本纠错任务，各模型均可基于已经训练好的纠错模型快速预测，也可使用自有数据训练、预测。


### kenlm模型（统计模型）
#### 中文拼写纠错

example: [examples/kenlm/demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/kenlm/demo.py)


```python
from pycorrector import Corrector
m = Corrector()
print(m.correct_batch(['少先队员因该为老人让坐', '你找到你最喜欢的工作，我也很高心。']))
```

output:
```shell
[{'source': '少先队员因该为老人让坐', 'target': '少先队员应该为老人让座', 'errors': [('因该', '应该', 4), ('坐', '座', 10)]}
{'source': '你找到你最喜欢的工作，我也很高心。', 'target': '你找到你最喜欢的工作，我也很高兴。', 'errors': [('心', '兴', 15)]}]
```

- `Corrector()`类是kenlm统计模型的纠错方法实现，默认会从路径`~/.pycorrector/datasets/zh_giga.no_cna_cmn.prune01244.klm`加载kenlm语言模型文件，如果检测没有该文件，
则程序会自动联网下载。当然也可以手动下载[模型文件(2.8G)](https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm)并放置于该位置
- 返回值: `correct`方法返回`dict`，{'source': '原句子', 'target': '纠正后的句子', 'errors': [('错误词', '正确词', '错误位置'), ...]}，`correct_batch`方法返回包含多个`dict`的`list`

#### 错误检测

example: [examples/kenlm/detect_demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/kenlm/detect_demo.py)

```python
from pycorrector import Corrector
m = Corrector()
idx_errors = m.detect('少先队员因该为老人让坐')
print(idx_errors)
```

output:

```
[['因该', 4, 6, 'word'], ['坐', 10, 11, 'char']]
```

- 返回值：`list`, `[error_word, begin_pos, end_pos, error_type]`，`pos`索引位置以0开始。

#### 成语、专名纠错

example: [examples/kenlm/use_custom_proper.py](https://github.com/shibing624/pycorrector/blob/master/examples/kenlm/use_custom_proper.py)

```python
from pycorrector import Corrector
m = Corrector(proper_name_path='./my_custom_proper.txt')
x = ['报应接中迩来', '这块名表带带相传',]
for i in x:
    print(i, ' -> ', m.correct(i))
```

output:

```
报应接中迩来  ->  {'source': '报应接踵而来', 'target': '报应接踵而来', 'errors': [('接中迩来', '接踵而来', 2)]}
这块名表带带相传  ->  {'source': '这块名表代代相传', 'target': '这块名表代代相传', 'errors': [('带带相传', '代代相传', 4)]}
```


#### 自定义混淆集

通过加载自定义混淆集，支持用户纠正已知的错误，包括两方面功能：1）【提升准确率】误杀加白；2）【提升召回率】补充召回。

example: [examples/kenlm/use_custom_confusion.py](https://github.com/shibing624/pycorrector/blob/master/examples/kenlm/use_custom_confusion.py)

```python
from pycorrector import Corrector

error_sentences = [
    '买iphonex，要多少钱',
    '共同实际控制人萧华、霍荣铨、张旗康',
]
m = Corrector()
print(m.correct_batch(error_sentences))
print('*' * 42)
m = Corrector(custom_confusion_path_or_dict='./my_custom_confusion.txt')
print(m.correct_batch(error_sentences))
```

output:

```
('买iphonex，要多少钱', [])   # "iphonex"漏召，应该是"iphoneX"
('共同实际控制人萧华、霍荣铨、张启康', [('张旗康', '张启康', 14)]) # "张启康"误杀，应该不用纠
*****************************************************
('买iphonex，要多少钱', [('iphonex', 'iphoneX', 1)])
('共同实际控制人萧华、霍荣铨、张旗康', [])
```

- 其中`./my_custom_confusion.txt`的内容格式如下，以空格间隔：

```
iPhone差 iPhoneX
张旗康 张旗康
```

自定义混淆集`ConfusionCorrector`类，除了上面演示的和`Corrector`类一起使用，还可以和`MacBertCorrector`一起使用，也可以独立使用。示例代码 [examples/macbert/model_correction_pipeline_demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/macbert/model_correction_pipeline_demo.py)

#### 自定义语言模型

默认提供下载并使用的kenlm语言模型`zh_giga.no_cna_cmn.prune01244.klm`文件是2.8G，内存小的电脑使用`pycorrector`程序可能会吃力些。

支持用户加载自己训练的kenlm语言模型，或使用2014版人民日报数据训练的模型，模型小（140M），准确率稍低，模型下载地址：[shibing624/chinese-kenlm-klm](https://huggingface.co/shibing624/chinese-kenlm-klm) | [people2014corpus_chars.klm(密码o5e9)](https://pan.baidu.com/s/1I2GElyHy_MAdek3YaziFYw)。

example：[examples/kenlm/load_custom_language_model.py](https://github.com/shibing624/pycorrector/blob/master/examples/kenlm/load_custom_language_model.py)

```python
from pycorrector import Corrector
model = Corrector(language_model_path='people2014corpus_chars.klm')
print(model.correct('少先队员因该为老人让坐'))
```

#### 英文拼写纠错

支持英文单词级别的拼写错误纠正。

example：[examples/kenlm/en_correct_demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/kenlm/en_correct_demo.py)

```python
from pycorrector import EnSpellCorrector
m = EnSpellCorrector()
sent = "what happending? how to speling it, can you gorrect it?"
print(m.correct(sent))
```

output:

```
{'source': 'what happending? how to speling it, can you gorrect it?', 'target': 'what happening? how to spelling it, can you correct it?', 'errors': [('happending', 'happening', 5), ('speling', 'spelling', 24), ('gorrect', 'correct', 44)]}
```

#### 中文简繁互换

支持中文繁体到简体的转换，和简体到繁体的转换。

example：[examples/kenlm/traditional_simplified_chinese_demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/kenlm/traditional_simplified_chinese_demo.py)

```python
import pycorrector

traditional_sentence = '憂郁的臺灣烏龜'
simplified_sentence = pycorrector.traditional2simplified(traditional_sentence)
print(traditional_sentence, '=>', simplified_sentence)

simplified_sentence = '忧郁的台湾乌龟'
traditional_sentence = pycorrector.simplified2traditional(simplified_sentence)
print(simplified_sentence, '=>', traditional_sentence)
```

output:

```
憂郁的臺灣烏龜 => 忧郁的台湾乌龟
忧郁的台湾乌龟 => 憂郁的臺灣烏龜
```

#### 命令行模式

支持kenlm方法的批量文本纠错

```
python -m pycorrector -h
usage: __main__.py [-h] -o OUTPUT [-n] [-d] input

@description:

positional arguments:
  input                 the input file path, file encode need utf-8.

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        the output file path.
  -n, --no_char         disable char detect mode.
  -d, --detail          print detail info
```

case：

```
python -m pycorrector input.txt -o out.txt -n -d
```

- 输入文件：`input.txt`；输出文件：`out.txt `；关闭字粒度纠错；打印详细纠错信息；纠错结果以`\t`间隔


### MacBert4CSC模型

基于MacBERT改变网络结构的中文拼写纠错模型，模型已经开源在HuggingFace Models：https://huggingface.co/shibing624/macbert4csc-base-chinese

模型网络结构：
- 本项目是 MacBERT 改变网络结构的中文文本纠错模型，可支持 BERT 类模型为 backbone
- 在原生 BERT 模型上进行了魔改，追加了一个全连接层作为错误检测即 [detection](https://github.com/shibing624/pycorrector/blob/c0f31222b7849c452cc1ec207c71e9954bd6ca08/pycorrector/macbert/macbert4csc.py#L18) ，
MacBERT4CSC 训练时用 detection 层和 correction 层的 loss 加权得到最终的 loss，预测时用 BERT MLM 的 correction 权重即可

![macbert_network](https://github.com/shibing624/pycorrector/blob/master/docs/git_image/macbert_network.jpg)

详细教程参考[examples/macbert/README.md](https://github.com/shibing624/pycorrector/blob/master/examples/macbert/README.md)


#### pycorrector快速预测
example：[examples/macbert/demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/macbert/demo.py)

```python
from pycorrector import MacBertCorrector
m = MacBertCorrector("shibing624/macbert4csc-base-chinese")
print(m.correct_batch(['今天新情很好', '你找到你最喜欢的工作，我也很高心。']))
```

output：

```bash
{'source': '今天新情很好', 'target': '今天心情很好', 'errors': [('新', '心', 2)]}
{'source': '你找到你最喜欢的工作，我也很高心。', 'target': '你找到你最喜欢的工作，我也很高兴。', 'errors': [('心', '兴', 15)]}
```

#### transformers快速预测
见[examples/macbert/README.md](https://github.com/shibing624/pycorrector/blob/master/examples/macbert/README.md)

### T5模型

基于T5的中文拼写纠错模型，模型训练详细教程参考[examples/t5/README.md](https://github.com/shibing624/pycorrector/blob/master/examples/t5/README.md)

#### pycorrector快速预测
example：[examples/t5/demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/t5/demo.py)
```python
from pycorrector import T5Corrector
m = T5Corrector()
print(m.correct_batch(['今天新情很好', '你找到你最喜欢的工作，我也很高心。']))
```

output:

```
[{'source': '今天新情很好', 'target': '今天心情很好', 'errors': [('新', '心', 2)]},
{'source': '你找到你最喜欢的工作，我也很高心。', 'target': '你找到你最喜欢的工作，我也很高兴。', 'errors': [('心', '兴', 15)]}]
```

### GPT模型
基于ChatGLM3、Qwen2.5、Qwen3等模型微调训练纠错模型，训练方法见[examples/gpt/README.md](https://github.com/shibing624/pycorrector/blob/master/examples/gpt/README.md)

#### pycorrector快速预测

example: [examples/gpt/demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/gpt/demo.py)
```python
from pycorrector.gpt.gpt_corrector import GptCorrector
m = GptCorrector()
print(m.correct_batch(['今天新情很好', '你找到你最喜欢的工作，我也很高心。']))
```

output:
```shell
[{'source': '今天新情很好', 'target': '今天心情很好', 'errors': [('新', '心', 2)]},
{'source': '你找到你最喜欢的工作，我也很高心。', 'target': '你找到你最喜欢的工作，我也很高兴。', 'errors': [('心', '兴', 15)]}]
```

### ErnieCSC模型

基于ERNIE的中文拼写纠错模型，模型已经开源在[PaddleNLP](https://bj.bcebos.com/paddlenlp/taskflow/text_correction/csc-ernie-1.0/csc-ernie-1.0.pdparams)。
模型网络结构：

<img src="https://user-images.githubusercontent.com/10826371/131974040-fc84ec04-566f-4310-9839-862bfb27172e.png" width="500" />

详细教程参考[examples/ernie_csc/README.md](https://github.com/shibing624/pycorrector/blob/master/examples/ernie_csc/README.md)



#### pycorrector快速预测
example：[examples/ernie_csc/demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/ernie_csc/demo.py)
```python
from pycorrector import ErnieCscCorrector

if __name__ == '__main__':
    error_sentences = [
        '真麻烦你了。希望你们好好的跳无',
        '少先队员因该为老人让坐',
    ]
    m = ErnieCscCorrector()
    batch_res = m.correct_batch(error_sentences)
    for i in batch_res:
        print(i)
        print()
```

output:

```
{'source': '真麻烦你了。希望你们好好的跳无', 'target': '真麻烦你了。希望你们好好的跳舞', 'errors': [{'position': 14, 'correction': {'无': '舞'}}]}
{'source': '少先队员因该为老人让坐', 'target': '少先队员应该为老人让座', 'errors': [{'position': 4, 'correction': {'因': '应'}}, {'position': 10, 'correction': {'坐': '座'}}]}
```




### Bart模型

基于SIGHAN+Wang271K中文纠错数据集训练的Bart4CSC模型，已经release到HuggingFace Models: https://huggingface.co/shibing624/bart4csc-base-chinese

```python
from transformers import BertTokenizerFast
from textgen import BartSeq2SeqModel

tokenizer = BertTokenizerFast.from_pretrained('shibing624/bart4csc-base-chinese')
model = BartSeq2SeqModel(
    encoder_type='bart',
    encoder_decoder_type='bart',
    encoder_decoder_name='shibing624/bart4csc-base-chinese',
    tokenizer=tokenizer,
    args={"max_length": 128, "eval_batch_size": 128})
sentences = ["少先队员因该为老人让坐"]
print(model.predict(sentences))
```

output:
```shell
['少先队员应该为老人让座']
```

如果需要训练Bart模型，请参考 https://github.com/shibing624/textgen/blob/main/examples/seq2seq/training_bartseq2seq_zh_demo.py


### MuCGECBart模型

模型在第一次运行时，会自动下载到"~/.cache/modelscope/hub/"子目录。
注意该模型在python=3.8.19环境下通过测试，其它依赖包版本可能会有问题。

#### 安装依赖
```shell
pip install pycorrector modelscope==1.16.0 fairseq==0.12.2
```

#### 使用示例
```python
from pycorrector.mucgec_bart.mucgec_bart_corrector import MuCGECBartCorrector


if __name__ == "__main__":
    m = MuCGECBartCorrector()
    result = m.correct_batch(['这洋的话，下一年的福气来到自己身上。', 
                               '在拥挤时间，为了让人们尊守交通规律，派至少两个警察或者交通管理者。', 
                               '随着中国经济突飞猛近，建造工业与日俱增', 
                               "北京是中国的都。", 
                               "他说：”我最爱的运动是打蓝球“", 
                               "我每天大约喝5次水左右。", 
                               "今天，我非常开开心。"])
    print(result)
```

output:
```shell
[{'source': '这洋的话，下一年的福气来到自己身上。', 'target': '这样的话，下一年的福气就会来到自己身上。', 'errors': [('洋', '样', 1), ('', '就会', 11)]},
{'source': '在拥挤时间，为了让人们尊守交通规律，派至少两个警察或者交通管理者。', 'target': '在拥挤时间，为了让人们遵守交通规则，应该派至少两个警察或者交通管理者。', 'errors': [('尊', '遵', 11), ('律', '则', 16), ('', '应该', 18)]},
{'source': '随着中国经济突飞猛近，建造工业与日俱增', 'target': '随着中国经济突飞猛进，建造工业与日俱增', 'errors': [('近', '进', 9)]},
{'source': '北京是中国的都。', 'target': '北京是中国的首都。', 'errors': [('', '首', 6)]},
{'source': '他说：”我最爱的运动是打蓝球“', 'target': '他说：“我最爱的运动是打篮球”', 'errors': [('”', '“', 3), ('蓝', '篮', 12), ('“', '”', 14)]},
{'source': '我每天大约喝5次水左右。', 'target': '我每天大约喝5杯水左右。', 'errors': [('次', '杯', 7)]},
{'source': '今天，我非常开开心。', 'target': '今天，我非常开心。', 'errors': [('开', '', 7)]}]
```



## Dataset

| 数据集                          | 语料                           |                                                                                下载链接                                                                                 | 压缩包大小 |
|:-----------------------------|:-----------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----:|
| **`SIGHAN+Wang271K中文纠错数据集`** | SIGHAN+Wang271K(27万条)        |               [百度网盘（密码01b9）](https://pan.baidu.com/s/1BV5tr9eONZCI0wERFvr0gQ) <br/> [shibing624/CSC](https://huggingface.co/datasets/shibing624/CSC)                | 106M  |
| **`原始SIGHAN数据集`**            | SIGHAN13 14 15               |                                                      [官方csc.html](http://nlp.ee.ncu.edu.tw/resource/csc.html)                                                       | 339K  |
| **`原始Wang271K数据集`**          | Wang271K                     |                   [Automatic-Corpus-Generation dimmywang提供](https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml)                    |  93M  |
| **`人民日报2014版语料`**            | 人民日报2014版                    |                                    [飞书（密码cHcu）](https://l6pmn3b1eo.feishu.cn/file/boxcnKpildqIseq1D4IrLwlir7c?from=from_qr_code)                                    | 383M  |
| **`NLPCC 2018 GEC官方数据集`**    | NLPCC2018-GEC                |                                        [官方trainingdata](http://tcci.ccf.org.cn/conference/2018/dldoc/trainingdata02.tar.gz)                                         | 114M  |
| **`NLPCC 2018+HSK熟语料`**      | nlpcc2018+hsk+CGED           | [百度网盘（密码m6fg）](https://pan.baidu.com/s/1BkDru60nQXaDVLRSr7ktfA) <br/> [飞书（密码gl9y）](https://l6pmn3b1eo.feishu.cn/file/boxcnudJgRs5GEMhZwe77YGTQfc?from=from_qr_code) | 215M  |
| **`NLPCC 2018+HSK原始语料`**     | HSK+Lang8                    | [百度网盘（密码n31j）](https://pan.baidu.com/s/1DaOX89uL1JRaZclfrV9C0g) <br/> [飞书（密码Q9LH）](https://l6pmn3b1eo.feishu.cn/file/boxcntebW3NI6OAaqzDUXlZHoDb?from=from_qr_code) |  81M  |
| **`中文纠错比赛数据汇总`**             | Chinese Text Correction（CTC） |                                                     [中文纠错汇总数据集（天池）](https://tianchi.aliyun.com/dataset/138195)                                                      |   -   |
| **`NLPCC 2023中文语法纠错数据集`**    | NLPCC 2023 Sharedtask1       |                          [Task 1: Chinese Grammatical Error Correction（Training Set）](http://tcci.ccf.org.cn/conference/2023/taskdata.php)                          | 125M  |
| **`百度智能文本校对比赛数据集`**          | 中文真实场景纠错数据                   |                          [shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)                          |  10M  |
| **`200万中文纠错数据集`**            | 中文语法和拼写纠错数据                  |                          [twnlp/ChinseseErrorCorrectData](https://huggingface.co/datasets/twnlp/ChinseseErrorCorrectData)                          |  2M   |



说明：

- SIGHAN+Wang271K中文纠错数据集(27万条)，是通过原始SIGHAN13、14、15年数据集和Wang271K数据集格式转化后得到，json格式，带错误字符位置信息，SIGHAN为test.json，
  macbert4csc模型训练可以直接用该数据集复现paper准召结果，详见[pycorrector/macbert/README.md](pycorrector/macbert/README.md)。
- NLPCC 2018 GEC官方数据集[NLPCC2018-GEC](http://tcci.ccf.org.cn/conference/2018/taskdata.php)，
  训练集[trainingdata](http://tcci.ccf.org.cn/conference/2018/dldoc/trainingdata02.tar.gz)[解压后114.5MB]，该数据格式是原始文本，未做切词处理。
- 汉语水平考试（HSK）和lang8原始平行语料[HSK+Lang8][百度网盘（密码n31j）](https://pan.baidu.com/s/1DaOX89uL1JRaZclfrV9C0g)，该数据集已经切词，可用作数据扩增。
- NLPCC 2018 + HSK + CGED16、17、18的数据，经过以字切分，繁体转简体，打乱数据顺序的预处理后，生成用于纠错的熟语料(nlpcc2018+hsk)
  ，[百度网盘（密码:m6fg）](https://pan.baidu.com/s/1BkDru60nQXaDVLRSr7ktfA) [130万对句子，215MB]

SIGHAN+Wang271K中文纠错数据集，数据格式：
```json
[
    {
        "id": "B2-4029-3",
        "original_text": "晚间会听到嗓音，白天的时候大家都不会太在意，但是在睡觉的时候这嗓音成为大家的恶梦。",
        "wrong_ids": [
            5,
            31
        ],
        "correct_text": "晚间会听到噪音，白天的时候大家都不会太在意，但是在睡觉的时候这噪音成为大家的恶梦。"
    }
]
```

字段解释：
- id：唯一标识符，无意义
- original_text: 原始错误文本
- wrong_ids： 错误字的位置，从0开始
- correct_text: 纠正后的文本

#### 自有数据集

可以使用自己数据集训练纠错模型，把自己数据集标注好，保存为跟训练样本集一样的json格式，然后加载数据训练模型即可。

1. 已有大量业务相关错误样本，主要标注错误位置（wrong_ids）和纠错后的句子(correct_text)
2. 没有现成的错误样本，可以写脚本生成错误样本（original_text），根据音似、形似等特征把正确句子的指定位置（wrong_ids）字符改为错字，附上
第三方同音字生成脚本[同音词替换](https://github.com/dongrixinyu/JioNLP/wiki/%E6%95%B0%E6%8D%AE%E5%A2%9E%E5%BC%BA-%E8%AF%B4%E6%98%8E%E6%96%87%E6%A1%A3#%E5%90%8C%E9%9F%B3%E8%AF%8D%E6%9B%BF%E6%8D%A2)


### Language Model

[什么是语言模型？-wiki](https://github.com/shibing624/pycorrector/wiki/%E7%BB%9F%E8%AE%A1%E8%AF%AD%E8%A8%80%E6%A8%A1%E5%9E%8B%E5%8E%9F%E7%90%86)

语言模型对于纠错步骤至关重要，当前默认使用的是从千兆中文文本训练的中文语言模型[zh_giga.no_cna_cmn.prune01244.klm(2.8G)](https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm)，
提供人民日报2014版语料训练得到的轻量版语言模型[people2014corpus_chars.klm(密码o5e9)](https://pan.baidu.com/s/1I2GElyHy_MAdek3YaziFYw)。

大家可以用中文维基（繁体转简体，pycorrector.utils.text_utils下有此功能）等语料数据训练通用的语言模型，或者也可以用专业领域语料训练更专用的语言模型。更适用的语言模型，对于纠错效果会有比较好的提升。

1. kenlm语言模型训练工具的使用，请见博客：http://blog.csdn.net/mingzai624/article/details/79560063
2. 16GB中英文无监督、平行语料[Linly-AI/Chinese-pretraining-dataset](https://huggingface.co/datasets/Linly-AI/Chinese-pretraining-dataset)
3. 524MB中文维基百科语料[wikipedia-cn-20230720-filtered](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered)



## Contact

- Github Issue(建议)：[![GitHub issues](https://img.shields.io/github/issues/shibing624/pycorrector.svg)](https://github.com/shibing624/pycorrector/issues)
- Github discussions：欢迎到讨论区[![GitHub discussions](https://img.shields.io/github/discussions/shibing624/pycorrector.svg)](https://github.com/shibing624/pycorrector/discussions)灌水（不会打扰开发者），公开交流纠错技术和问题
- 邮件我：xuming: xuming624@qq.com
- 微信我：加我*微信号：xuming624*, 进Python-NLP交流群，备注：*姓名-公司名-NLP*


<img src="https://github.com/shibing624/pycorrector/blob/master/docs/git_image/wechat.jpeg" width="200" />

<img src="https://github.com/shibing624/pycorrector/blob/master/docs/git_image/wechat_group.jpg" width="200" />

## Citation

如果你在研究中使用了pycorrector，请按如下格式引用：

APA:
```latex
Xu, M. Pycorrector: Text error correction tool (Version 0.4.2) [Computer software]. https://github.com/shibing624/pycorrector
```

BibTeX:
```latex
@misc{Xu_Pycorrector_Text_error,
  title={Pycorrector: Text error correction tool},
  author={Ming Xu},
  year={2023},
  howpublished={\url{https://github.com/shibing624/pycorrector}},
}
```



## License

pycorrector 的授权协议为 **Apache License 2.0**，可免费用做商业用途。请在产品说明中附加pycorrector的链接和授权协议。

## Contribute

项目代码还很粗糙，如果大家对代码有所改进，欢迎提交回本项目，在提交之前，注意以下两点：

- 在`tests`添加相应的单元测试
- 使用`python -m pytest`来运行所有单元测试，确保所有单测都是通过的

之后即可提交PR。

## References

* [基于文法模型的中文纠错系统](https://blog.csdn.net/mingzai624/article/details/82390382)
* [Norvig’s spelling corrector](http://norvig.com/spell-correct.html)
* [Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape[Yu, 2013]](http://www.aclweb.org/anthology/W/W14/W14-6835.pdf)
* [Chinese Spelling Checker Based on Statistical Machine Translation[Chiu, 2013]](http://www.aclweb.org/anthology/O/O13/O13-1005.pdf)
* [Chinese Word Spelling Correction Based on Rule Induction[yeh, 2014]](http://aclweb.org/anthology/W14-6822)
* [Neural Language Correction with Character-Based Attention[Ziang Xie, 2016]](https://arxiv.org/pdf/1603.09727.pdf)
* [Chinese Spelling Check System Based on Tri-gram Model[Qiang Huang, 2014]](http://www.anthology.aclweb.org/W/W14/W14-6827.pdf)
* [Neural Abstractive Text Summarization with Sequence-to-Sequence Models[Tian Shi, 2018]](https://arxiv.org/abs/1812.02303)
* [基于深度学习的中文文本自动校对研究与实现[杨宗霖, 2019]](https://github.com/shibing624/pycorrector/blob/master/docs/基于深度学习的中文文本自动校对研究与实现.pdf)
* [A Sequence to Sequence Learning for Chinese Grammatical Error Correction[Hongkai Ren, 2018]](https://link.springer.com/chapter/10.1007/978-3-319-99501-4_36)
* [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/pdf?id=r1xMH1BtvB)
* [Revisiting Pre-trained Models for Chinese Natural Language Processing](https://arxiv.org/abs/2004.13922)
* Ruiqing Zhang, Chao Pang et al. "Correcting Chinese Spelling Errors with Phonetic Pre-training", ACL, 2021
* DingminWang et al. "A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check", EMNLP, 2018
* [MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction](https://aclanthology.org/2022.naacl-main.227) (Zhang et al., NAACL 2022)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/shibing624/pycorrector",
    "name": "pycorrector",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "pycorrector, correction, Chinese error correction, NLP",
    "author": "XuMing",
    "author_email": "xuming624@qq.com",
    "download_url": "https://files.pythonhosted.org/packages/3d/b3/63187b2a3fdebc37eaa1a01f0bf9ff81ddd290cbfd6f0b82785126e908df/pycorrector-1.1.3.tar.gz",
    "platform": "Windows",
    "description": "[**\ud83c\udde8\ud83c\uddf3\u4e2d\u6587**](https://github.com/shibing624/pycorrector/blob/master/README.md) | [**\ud83c\udf10English**](https://github.com/shibing624/pycorrector/blob/master/README_EN.md) | [**\ud83d\udcd6\u6587\u6863/Docs**](https://github.com/shibing624/pycorrector/wiki) | [**\ud83e\udd16\u6a21\u578b/Models**](https://huggingface.co/shibing624) \n\n<div align=\"center\">\n  <a href=\"https://github.com/shibing624/pycorrector\">\n    <img src=\"https://github.com/shibing624/pycorrector/blob/master/docs/pycorrector.png\" alt=\"Logo\" height=\"156\">\n  </a>\n</div>\n\n-----------------\n\n# pycorrector: useful python text correction toolkit\n[![PyPI version](https://badge.fury.io/py/pycorrector.svg)](https://badge.fury.io/py/pycorrector)\n[![Downloads](https://static.pepy.tech/badge/pycorrector)](https://pepy.tech/project/pycorrector)\n[![GitHub contributors](https://img.shields.io/github/contributors/shibing624/pycorrector.svg)](https://github.com/shibing624/pycorrector/graphs/contributors)\n[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)\n[![python_vesion](https://img.shields.io/badge/Python-3.8%2B-green.svg)](requirements.txt)\n[![GitHub issues](https://img.shields.io/github/issues/shibing624/pycorrector.svg)](https://github.com/shibing624/pycorrector/issues)\n[![Wechat Group](https://img.shields.io/badge/wechat-group-green.svg?logo=wechat)](#Contact)\n\n\n**pycorrector**: \u4e2d\u6587\u6587\u672c\u7ea0\u9519\u5de5\u5177\u3002\u652f\u6301\u4e2d\u6587\u97f3\u4f3c\u3001\u5f62\u4f3c\u3001\u8bed\u6cd5\u9519\u8bef\u7ea0\u6b63\uff0cpython3.8\u5f00\u53d1\u3002\n\n**pycorrector**\u5b9e\u73b0\u4e86Kenlm\u3001ConvSeq2Seq\u3001BERT\u3001MacBERT\u3001ELECTRA\u3001ERNIE\u3001GPT\u7b49\u591a\u79cd\u6a21\u578b\u7684\u6587\u672c\u7ea0\u9519\uff0c\u8bc4\u4f30\u5404\u6a21\u578b\u7684\u6548\u679c\u3002\n\n**Guide**\n\n- [Features](#Features)\n- [Evaluation](#Evaluation)\n- [Usage](#usage)\n- [Dataset](#Dataset)\n- [Contact](#Contact)\n- [References](#references)\n\n## Introduction\n\n\u4e2d\u6587\u6587\u672c\u7ea0\u9519\u4efb\u52a1\uff0c\u5e38\u89c1\u9519\u8bef\u7c7b\u578b\uff1a\n\n<img src=\"https://github.com/shibing624/pycorrector/blob/master/docs/git_image/error_type.png\" width=\"600\" />\n\n\u5f53\u7136\uff0c\u9488\u5bf9\u4e0d\u540c\u4e1a\u52a1\u573a\u666f\uff0c\u8fd9\u4e9b\u95ee\u9898\u5e76\u4e0d\u4e00\u5b9a\u5168\u90e8\u5b58\u5728\uff0c\u6bd4\u5982\u62fc\u97f3\u8f93\u5165\u6cd5\u3001\u8bed\u97f3\u8bc6\u522b\u6821\u5bf9\u5173\u6ce8\u97f3\u4f3c\u9519\u8bef\uff1b\u4e94\u7b14\u8f93\u5165\u6cd5\u3001OCR\u6821\u5bf9\u5173\u6ce8\u5f62\u4f3c\u9519\u8bef\uff0c\n\u641c\u7d22\u5f15\u64cequery\u7ea0\u9519\u5173\u6ce8\u6240\u6709\u9519\u8bef\u7c7b\u578b\u3002\n\n\u672c\u9879\u76ee\u91cd\u70b9\u89e3\u51b3\u5176\u4e2d\u7684\"\u97f3\u4f3c\u3001\u5f62\u5b57\u3001\u8bed\u6cd5\u3001\u4e13\u540d\u9519\u8bef\"\u7b49\u7c7b\u578b\u3002\n\n## News\n[2025/07/08] v1.1.2\u7248\u672c\uff1a\u652f\u6301\u4e86\u57fa\u4e8eQwen3\u7684\u4e2d\u6587\u6587\u672c\u7ea0\u9519\u6a21\u578b[twnlp/ChineseErrorCorrector3-4B](https://huggingface.co/twnlp/ChineseErrorCorrector3-4B)\uff0c\u652f\u6301\u591a\u5b57\u3001\u5c11\u5b57\u3001\u9519\u5b57\u3001\u8bcd\u5e8f\u3001\u8bed\u6cd5\u7b49\u9519\u8bef\u7ea0\u6b63\u3002\u8be6\u89c1[Release-v1.1.2](https://github.com/shibing624/pycorrector/releases/tag/1.1.2)\n\n[2024/10/14] v1.1.0\u7248\u672c\uff1a\u65b0\u589e\u4e86\u57fa\u4e8eQwen2.5\u7684\u4e2d\u6587\u6587\u672c\u7ea0\u9519\u6a21\u578b\uff0c\u652f\u6301\u591a\u5b57\u3001\u5c11\u5b57\u3001\u9519\u5b57\u3001\u8bcd\u5e8f\u3001\u8bed\u6cd5\u7b49\u9519\u8bef\u7ea0\u6b63\uff0c\u53d1\u5e03\u4e86[shibing624/chinese-text-correction-1.5b](https://huggingface.co/shibing624/chinese-text-correction-1.5b)\u548c[shibing624/chinese-text-correction-7b](https://huggingface.co/shibing624/chinese-text-correction-7b)\u6a21\u578b\uff0c\u53ca\u5176\u5bf9\u5e94\u7684LoRA\u6a21\u578b\u3002\u8be6\u89c1[Release-v1.1.0](https://github.com/shibing624/pycorrector/releases/tag/1.1.0)\n\n[2023/11/07] v1.0.0\u7248\u672c\uff1a\u65b0\u589e\u4e86ChatGLM3/LLaMA2\u7b49GPT\u6a21\u578b\u7528\u4e8e\u4e2d\u6587\u6587\u672c\u7ea0\u9519\uff0c\u53d1\u5e03\u4e86\u57fa\u4e8eChatGLM3-6B\u7684[shibing624/chatglm3-6b-csc-chinese-lora](https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora)\u62fc\u5199\u548c\u8bed\u6cd5\u7ea0\u9519\u6a21\u578b\uff1b\u91cd\u5199\u4e86DeepContext\u3001ConvSeq2Seq\u3001T5\u7b49\u6a21\u578b\u7684\u5b9e\u73b0\u3002\u8be6\u89c1[Release-v1.0.0](https://github.com/shibing624/pycorrector/releases/tag/1.0.0)\n\n\n## Features\n\n* [Kenlm\u6a21\u578b](https://github.com/shibing624/pycorrector/tree/master/examples/kenlm)\uff1a\u672c\u9879\u76ee\u57fa\u4e8eKenlm\u7edf\u8ba1\u8bed\u8a00\u6a21\u578b\u5de5\u5177\u8bad\u7ec3\u4e86\u4e2d\u6587NGram\u8bed\u8a00\u6a21\u578b\uff0c\u7ed3\u5408\u89c4\u5219\u65b9\u6cd5\u3001\u6df7\u6dc6\u96c6\u53ef\u4ee5\u7ea0\u6b63\u4e2d\u6587\u62fc\u5199\u9519\u8bef\uff0c\u65b9\u6cd5\u901f\u5ea6\u5feb\uff0c\u6269\u5c55\u6027\u5f3a\uff0c\u6548\u679c\u4e00\u822c\n* [DeepContext\u6a21\u578b](https://github.com/shibing624/pycorrector/tree/master/examples/deepcontext)\uff1a\u672c\u9879\u76ee\u57fa\u4e8ePyTorch\u5b9e\u73b0\u4e86\u7528\u4e8e\u6587\u672c\u7ea0\u9519\u7684DeepContext\u6a21\u578b\uff0c\u8be5\u6a21\u578b\u7ed3\u6784\u53c2\u8003Stanford University\u7684NLC\u6a21\u578b\uff0c2014\u82f1\u6587\u7ea0\u9519\u6bd4\u8d5b\u5f97\u7b2c\u4e00\u540d\uff0c\u6548\u679c\u4e00\u822c\n* [Seq2Seq\u6a21\u578b](https://github.com/shibing624/pycorrector/tree/master/examples/seq2seq)\uff1a\u672c\u9879\u76ee\u57fa\u4e8ePyTorch\u5b9e\u73b0\u4e86\u7528\u4e8e\u4e2d\u6587\u6587\u672c\u7ea0\u9519\u7684ConvSeq2Seq\u6a21\u578b\uff0c\u8be5\u6a21\u578b\u5728NLPCC-2018\u7684\u4e2d\u6587\u8bed\u6cd5\u7ea0\u9519\u6bd4\u8d5b\u4e2d\uff0c\u4f7f\u7528\u5355\u6a21\u578b\u5e76\u53d6\u5f97\u7b2c\u4e09\u540d\uff0c\u53ef\u4ee5\u5e76\u884c\u8bad\u7ec3\uff0c\u6a21\u578b\u6536\u655b\u5feb\uff0c\u6548\u679c\u4e00\u822c\n* [T5\u6a21\u578b](https://github.com/shibing624/pycorrector/tree/master/examples/t5)\uff1a\u672c\u9879\u76ee\u57fa\u4e8ePyTorch\u5b9e\u73b0\u4e86\u7528\u4e8e\u4e2d\u6587\u6587\u672c\u7ea0\u9519\u7684T5\u6a21\u578b\uff0c\u4f7f\u7528Langboat/mengzi-t5-base\u7684\u9884\u8bad\u7ec3\u6a21\u578bfinetune\u4e2d\u6587\u7ea0\u9519\u6570\u636e\u96c6\uff0c\u6a21\u578b\u6539\u9020\u7684\u6f5c\u529b\u8f83\u5927\uff0c\u6548\u679c\u597d\n* [ERNIE_CSC\u6a21\u578b](https://github.com/shibing624/pycorrector/tree/master/examples/ernie_csc)\uff1a\u672c\u9879\u76ee\u57fa\u4e8ePaddlePaddle\u5b9e\u73b0\u4e86\u7528\u4e8e\u4e2d\u6587\u6587\u672c\u7ea0\u9519\u7684ERNIE_CSC\u6a21\u578b\uff0c\u6a21\u578b\u5728ERNIE-1.0\u4e0afinetune\uff0c\u6a21\u578b\u7ed3\u6784\u9002\u914d\u4e86\u4e2d\u6587\u62fc\u5199\u7ea0\u9519\u4efb\u52a1\uff0c\u6548\u679c\u597d\n* [MacBERT\u6a21\u578b](https://github.com/shibing624/pycorrector/tree/master/examples/macbert)\u3010\u63a8\u8350\u3011\uff1a\u672c\u9879\u76ee\u57fa\u4e8ePyTorch\u5b9e\u73b0\u4e86\u7528\u4e8e\u4e2d\u6587\u6587\u672c\u7ea0\u9519\u7684MacBERT4CSC\u6a21\u578b\uff0c\u6a21\u578b\u52a0\u5165\u4e86\u9519\u8bef\u68c0\u6d4b\u548c\u7ea0\u6b63\u7f51\u7edc\uff0c\u9002\u914d\u4e2d\u6587\u62fc\u5199\u7ea0\u9519\u4efb\u52a1\uff0c\u6548\u679c\u597d\n* [MuCGECBart\u6a21\u578b](https://modelscope.cn/models/iic/nlp_bart_text-error-correction_chinese/summary)\uff1a\u672c\u9879\u76ee\u57fa\u4e8eModelScope\u5b9e\u73b0\u4e86\u7528\u4e8e\u6587\u672c\u7ea0\u9519\u7684Seq2Seq\u65b9\u6cd5\u7684MuCGECBart\u6a21\u578b\uff0c\u8be5\u6a21\u578b\u4e2d\u6587\u6587\u672c\u7ea0\u9519\u6548\u679c\u8f83\u597d\n* [NaSGECBart\u6a21\u578b](https://github.com/HillZhang1999/NaSGEC): MuCGECBart\u7684\u540c\u4f5c\u8005\u6a21\u578b\uff0c\u65e0\u9700modelscope\u4f9d\u8d56\uff0c\u57fa\u4e8e\u4e2d\u6587\u6bcd\u8bed\u7ea0\u9519\u6570\u636e\u96c6NaSGEC\u5728Bart\u6a21\u578b\u4e0a\u5fae\u8c03\u8bad\u7ec3\u5f97\u5230\uff0c\u6548\u679c\u597d\n* [GPT\u6a21\u578b](https://github.com/shibing624/pycorrector/tree/master/examples/gpt)\uff1a\u672c\u9879\u76ee\u57fa\u4e8ePyTorch\u5b9e\u73b0\u4e86\u7528\u4e8e\u4e2d\u6587\u6587\u672c\u7ea0\u9519\u7684ChatGLM/LLaMA\u6a21\u578b\uff0c\u6a21\u578b\u5728\u4e2d\u6587CSC\u548c\u8bed\u6cd5\u7ea0\u9519\u6570\u636e\u96c6\u4e0afinetune\uff0c\u9002\u914d\u4e2d\u6587\u6587\u672c\u7ea0\u9519\u4efb\u52a1\uff0c\u6548\u679c\u5f88\u597d\n\n\n\n- \u5ef6\u5c55\u9605\u8bfb\uff1a[\u4e2d\u6587\u6587\u672c\u7ea0\u9519\u5b9e\u8df5\u548c\u539f\u7406\u89e3\u8bfb](https://github.com/shibing624/pycorrector/blob/master/docs/correction_solution.md)\n## Demo\n\n- Official demo: https://www.mulanai.com/product/corrector/\n\n- Colab online demo: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1zvSyCdiLK_rglfXcIgc539K_Z7bIMpu0?usp=sharing)\n\n- HuggingFace demo: https://huggingface.co/spaces/shibing624/pycorrector\n\n![](https://github.com/shibing624/pycorrector/blob/master/docs/hf.png)\n\nrun example: [examples/macbert/gradio_demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/macbert/gradio_demo.py) to see the demo:\n```shell\npython examples/macbert/gradio_demo.py\n```\n\n## Evaluation\n\n\u8bc4\u4f30\u811a\u672c[examples/evaluate_models/evaluate_models.py](https://github.com/shibing624/pycorrector/blob/master/examples/evaluate_models/evaluate_models.py)\uff1a\n\n- \u8bc4\u6d4b\u96c6\uff1aSIGHAN-2015([sighan2015_test.tsv](https://github.com/shibing624/pycorrector/blob/master/pycorrector/data/sighan2015_test.tsv))\u3001\nEC-LAW([ec_law_test.tsv](https://github.com/shibing624/pycorrector/blob/master/examples/data/ec_law_test.tsv))\u3001MCSC([mcsc_test.tsv](https://github.com/shibing624/pycorrector/blob/master/examples/data/mcsc_test.tsv))\n- \u8bc4\u4f30\u6807\u51c6\uff1a\u7ea0\u9519\u51c6\u53ec\u7387\uff0c\u91c7\u7528\u4e25\u683c\u53e5\u5b50\u7c92\u5ea6\uff08Sentence Level\uff09\u8ba1\u7b97\u65b9\u5f0f\uff0c\u628a\u6a21\u578b\u7ea0\u6b63\u4e4b\u540e\u7684\u4e0e\u6b63\u786e\u53e5\u5b50\u5b8c\u6210\u76f8\u540c\u7684\u89c6\u4e3a\u6b63\u786e\uff0c\u5426\u5219\u4e3a\u9519\n\n### \u8bc4\u4f30\u7ed3\u679c\n- \u8bc4\u4f30\u6307\u6807\uff1aF1\n- CSC(Chinese Spelling Correction): \u62fc\u5199\u7ea0\u9519\u6a21\u578b\uff0c\u8868\u793a\u6a21\u578b\u53ef\u4ee5\u5904\u7406\u97f3\u4f3c\u3001\u5f62\u4f3c\u3001\u8bed\u6cd5\u7b49\u957f\u5ea6\u5bf9\u9f50\u7684\u9519\u8bef\u7ea0\u6b63\n- CTC(CHinese Text Correction): \u6587\u672c\u7ea0\u9519\u6a21\u578b\uff0c\u8868\u793a\u6a21\u578b\u652f\u6301\u62fc\u5199\u3001\u8bed\u6cd5\u7b49\u957f\u5ea6\u5bf9\u9f50\u7684\u9519\u8bef\u7ea0\u6b63\uff0c\u8fd8\u53ef\u4ee5\u5904\u7406\u591a\u5b57\u3001\u5c11\u5b57\u7b49\u957f\u5ea6\u4e0d\u5bf9\u9f50\u7684\u9519\u8bef\u7ea0\u6b63\n- GPU\uff1aTesla V100\uff0c\u663e\u5b58 32 GB\n\n| Model Name       | Model Link                                                                                                              | Base Model                     | Avg        | SIGHAN-2015 | EC-LAW | MCSC   | GPU | QPS     |\n|:-----------------|:------------------------------------------------------------------------------------------------------------------------|:-------------------------------|:-----------|:------------|:-------|:-------|:--------|:--------|\n| Kenlm-CSC        | [shibing624/chinese-kenlm-klm](https://huggingface.co/shibing624/chinese-kenlm-klm)                                     | kenlm                          | 0.3409     | 0.3147      | 0.3763 | 0.3317 | CPU     | 9       |\n| Mengzi-T5-CSC    | [shibing624/mengzi-t5-base-chinese-correction](https://huggingface.co/shibing624/mengzi-t5-base-chinese-correction)     | mengzi-t5-base                 | 0.3984     | 0.7758      | 0.3156 | 0.1039 | GPU     | 214     |\n| ERNIE-CSC        | [PaddleNLP/ernie-csc](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/legacy/examples/text_correction/ernie-csc) | PaddlePaddle/ernie-1.0-base-zh | 0.4353     | 0.8383      | 0.3357 | 0.1318 | GPU     | 114     |\n| MacBERT-CSC      | [shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese)                       | hfl/chinese-macbert-base       | 0.3993     | 0.8314      | 0.1610 | 0.2055 | GPU     | **224** |\n| ChatGLM3-6B-CSC  | [shibing624/chatglm3-6b-csc-chinese-lora](https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora)               | THUDM/chatglm3-6b              | 0.4538     | 0.6572      | 0.4369 | 0.2672 | GPU     | 3       |\n| Qwen2.5-1.5B-CTC | [shibing624/chinese-text-correction-1.5b](https://huggingface.co/shibing624/chinese-text-correction-1.5b)               | Qwen/Qwen2.5-1.5B-Instruct     | 0.6802     | 0.3032      | 0.7846 | 0.9529 | GPU     | 6       |\n| Qwen2.5-7B-CTC   | [shibing624/chinese-text-correction-7b](https://huggingface.co/shibing624/chinese-text-correction-7b)                   | Qwen/Qwen2.5-7B-Instruct       | **0.8225** | 0.4917      | 0.9798 | 0.9959 | GPU     | 3       |\n| Qwen3-4B-CTC     | [twnlp/ChineseErrorCorrector3-4B](https://huggingface.co/twnlp/ChineseErrorCorrector3-4B)                   | Qwen/Qwen3-4B                  | **0.7792** | 0.5270      | 0.8115 | 0.9990 | GPU     | 5       |\n\n\n## Install\n\n```shell\npip install -U pycorrector\n```\n\nor\n\n```shell\npip install -r requirements.txt\n\ngit clone https://github.com/shibing624/pycorrector.git\ncd pycorrector\npip install --no-deps .\n```\n\n\n\u901a\u8fc7\u4ee5\u4e0a\u4e24\u79cd\u65b9\u6cd5\u7684\u4efb\u4f55\u4e00\u79cd\u5b8c\u6210\u5b89\u88c5\u90fd\u53ef\u4ee5\u3002\u5982\u679c\u4e0d\u60f3\u5b89\u88c5\u4f9d\u8d56\u5305\uff0c\u53ef\u4ee5\u62c9docker\u73af\u5883\u3002\n\n* docker\u4f7f\u7528\n\n```shell\ndocker run -it -v ~/.pycorrector:/root/.pycorrector shibing624/pycorrector:0.0.2\n```\n\n## Usage\n\u672c\u9879\u76ee\u7684\u521d\u8877\u4e4b\u4e00\u662f\u6bd4\u5bf9\u3001\u8c03\u7814\u5404\u79cd\u4e2d\u6587\u6587\u672c\u7ea0\u9519\u65b9\u6cd5\uff0c\u629b\u7816\u5f15\u7389\u3002\n\n\u9879\u76ee\u5b9e\u73b0\u4e86kenlm\u3001macbert\u3001seq2seq\u3001 ernie_csc\u3001T5\u3001deepcontext\u3001GPT(Qwen/ChatGLM)\u7b49\u6a21\u578b\u5e94\u7528\u4e8e\u6587\u672c\u7ea0\u9519\u4efb\u52a1\uff0c\u5404\u6a21\u578b\u5747\u53ef\u57fa\u4e8e\u5df2\u7ecf\u8bad\u7ec3\u597d\u7684\u7ea0\u9519\u6a21\u578b\u5feb\u901f\u9884\u6d4b\uff0c\u4e5f\u53ef\u4f7f\u7528\u81ea\u6709\u6570\u636e\u8bad\u7ec3\u3001\u9884\u6d4b\u3002\n\n\n### kenlm\u6a21\u578b\uff08\u7edf\u8ba1\u6a21\u578b\uff09\n#### \u4e2d\u6587\u62fc\u5199\u7ea0\u9519\n\nexample: [examples/kenlm/demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/kenlm/demo.py)\n\n\n```python\nfrom pycorrector import Corrector\nm = Corrector()\nprint(m.correct_batch(['\u5c11\u5148\u961f\u5458\u56e0\u8be5\u4e3a\u8001\u4eba\u8ba9\u5750', '\u4f60\u627e\u5230\u4f60\u6700\u559c\u6b22\u7684\u5de5\u4f5c\uff0c\u6211\u4e5f\u5f88\u9ad8\u5fc3\u3002']))\n```\n\noutput:\n```shell\n[{'source': '\u5c11\u5148\u961f\u5458\u56e0\u8be5\u4e3a\u8001\u4eba\u8ba9\u5750', 'target': '\u5c11\u5148\u961f\u5458\u5e94\u8be5\u4e3a\u8001\u4eba\u8ba9\u5ea7', 'errors': [('\u56e0\u8be5', '\u5e94\u8be5', 4), ('\u5750', '\u5ea7', 10)]}\n{'source': '\u4f60\u627e\u5230\u4f60\u6700\u559c\u6b22\u7684\u5de5\u4f5c\uff0c\u6211\u4e5f\u5f88\u9ad8\u5fc3\u3002', 'target': '\u4f60\u627e\u5230\u4f60\u6700\u559c\u6b22\u7684\u5de5\u4f5c\uff0c\u6211\u4e5f\u5f88\u9ad8\u5174\u3002', 'errors': [('\u5fc3', '\u5174', 15)]}]\n```\n\n- `Corrector()`\u7c7b\u662fkenlm\u7edf\u8ba1\u6a21\u578b\u7684\u7ea0\u9519\u65b9\u6cd5\u5b9e\u73b0\uff0c\u9ed8\u8ba4\u4f1a\u4ece\u8def\u5f84`~/.pycorrector/datasets/zh_giga.no_cna_cmn.prune01244.klm`\u52a0\u8f7dkenlm\u8bed\u8a00\u6a21\u578b\u6587\u4ef6\uff0c\u5982\u679c\u68c0\u6d4b\u6ca1\u6709\u8be5\u6587\u4ef6\uff0c\n\u5219\u7a0b\u5e8f\u4f1a\u81ea\u52a8\u8054\u7f51\u4e0b\u8f7d\u3002\u5f53\u7136\u4e5f\u53ef\u4ee5\u624b\u52a8\u4e0b\u8f7d[\u6a21\u578b\u6587\u4ef6(2.8G)](https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm)\u5e76\u653e\u7f6e\u4e8e\u8be5\u4f4d\u7f6e\n- \u8fd4\u56de\u503c: `correct`\u65b9\u6cd5\u8fd4\u56de`dict`\uff0c{'source': '\u539f\u53e5\u5b50', 'target': '\u7ea0\u6b63\u540e\u7684\u53e5\u5b50', 'errors': [('\u9519\u8bef\u8bcd', '\u6b63\u786e\u8bcd', '\u9519\u8bef\u4f4d\u7f6e'), ...]}\uff0c`correct_batch`\u65b9\u6cd5\u8fd4\u56de\u5305\u542b\u591a\u4e2a`dict`\u7684`list`\n\n#### \u9519\u8bef\u68c0\u6d4b\n\nexample: [examples/kenlm/detect_demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/kenlm/detect_demo.py)\n\n```python\nfrom pycorrector import Corrector\nm = Corrector()\nidx_errors = m.detect('\u5c11\u5148\u961f\u5458\u56e0\u8be5\u4e3a\u8001\u4eba\u8ba9\u5750')\nprint(idx_errors)\n```\n\noutput:\n\n```\n[['\u56e0\u8be5', 4, 6, 'word'], ['\u5750', 10, 11, 'char']]\n```\n\n- \u8fd4\u56de\u503c\uff1a`list`, `[error_word, begin_pos, end_pos, error_type]`\uff0c`pos`\u7d22\u5f15\u4f4d\u7f6e\u4ee50\u5f00\u59cb\u3002\n\n#### \u6210\u8bed\u3001\u4e13\u540d\u7ea0\u9519\n\nexample: [examples/kenlm/use_custom_proper.py](https://github.com/shibing624/pycorrector/blob/master/examples/kenlm/use_custom_proper.py)\n\n```python\nfrom pycorrector import Corrector\nm = Corrector(proper_name_path='./my_custom_proper.txt')\nx = ['\u62a5\u5e94\u63a5\u4e2d\u8fe9\u6765', '\u8fd9\u5757\u540d\u8868\u5e26\u5e26\u76f8\u4f20',]\nfor i in x:\n    print(i, ' -> ', m.correct(i))\n```\n\noutput:\n\n```\n\u62a5\u5e94\u63a5\u4e2d\u8fe9\u6765  ->  {'source': '\u62a5\u5e94\u63a5\u8e35\u800c\u6765', 'target': '\u62a5\u5e94\u63a5\u8e35\u800c\u6765', 'errors': [('\u63a5\u4e2d\u8fe9\u6765', '\u63a5\u8e35\u800c\u6765', 2)]}\n\u8fd9\u5757\u540d\u8868\u5e26\u5e26\u76f8\u4f20  ->  {'source': '\u8fd9\u5757\u540d\u8868\u4ee3\u4ee3\u76f8\u4f20', 'target': '\u8fd9\u5757\u540d\u8868\u4ee3\u4ee3\u76f8\u4f20', 'errors': [('\u5e26\u5e26\u76f8\u4f20', '\u4ee3\u4ee3\u76f8\u4f20', 4)]}\n```\n\n\n#### \u81ea\u5b9a\u4e49\u6df7\u6dc6\u96c6\n\n\u901a\u8fc7\u52a0\u8f7d\u81ea\u5b9a\u4e49\u6df7\u6dc6\u96c6\uff0c\u652f\u6301\u7528\u6237\u7ea0\u6b63\u5df2\u77e5\u7684\u9519\u8bef\uff0c\u5305\u62ec\u4e24\u65b9\u9762\u529f\u80fd\uff1a1\uff09\u3010\u63d0\u5347\u51c6\u786e\u7387\u3011\u8bef\u6740\u52a0\u767d\uff1b2\uff09\u3010\u63d0\u5347\u53ec\u56de\u7387\u3011\u8865\u5145\u53ec\u56de\u3002\n\nexample: [examples/kenlm/use_custom_confusion.py](https://github.com/shibing624/pycorrector/blob/master/examples/kenlm/use_custom_confusion.py)\n\n```python\nfrom pycorrector import Corrector\n\nerror_sentences = [\n    '\u4e70iphonex\uff0c\u8981\u591a\u5c11\u94b1',\n    '\u5171\u540c\u5b9e\u9645\u63a7\u5236\u4eba\u8427\u534e\u3001\u970d\u8363\u94e8\u3001\u5f20\u65d7\u5eb7',\n]\nm = Corrector()\nprint(m.correct_batch(error_sentences))\nprint('*' * 42)\nm = Corrector(custom_confusion_path_or_dict='./my_custom_confusion.txt')\nprint(m.correct_batch(error_sentences))\n```\n\noutput:\n\n```\n('\u4e70iphonex\uff0c\u8981\u591a\u5c11\u94b1', [])   # \"iphonex\"\u6f0f\u53ec\uff0c\u5e94\u8be5\u662f\"iphoneX\"\n('\u5171\u540c\u5b9e\u9645\u63a7\u5236\u4eba\u8427\u534e\u3001\u970d\u8363\u94e8\u3001\u5f20\u542f\u5eb7', [('\u5f20\u65d7\u5eb7', '\u5f20\u542f\u5eb7', 14)]) # \"\u5f20\u542f\u5eb7\"\u8bef\u6740\uff0c\u5e94\u8be5\u4e0d\u7528\u7ea0\n*****************************************************\n('\u4e70iphonex\uff0c\u8981\u591a\u5c11\u94b1', [('iphonex', 'iphoneX', 1)])\n('\u5171\u540c\u5b9e\u9645\u63a7\u5236\u4eba\u8427\u534e\u3001\u970d\u8363\u94e8\u3001\u5f20\u65d7\u5eb7', [])\n```\n\n- \u5176\u4e2d`./my_custom_confusion.txt`\u7684\u5185\u5bb9\u683c\u5f0f\u5982\u4e0b\uff0c\u4ee5\u7a7a\u683c\u95f4\u9694\uff1a\n\n```\niPhone\u5dee iPhoneX\n\u5f20\u65d7\u5eb7 \u5f20\u65d7\u5eb7\n```\n\n\u81ea\u5b9a\u4e49\u6df7\u6dc6\u96c6`ConfusionCorrector`\u7c7b\uff0c\u9664\u4e86\u4e0a\u9762\u6f14\u793a\u7684\u548c`Corrector`\u7c7b\u4e00\u8d77\u4f7f\u7528\uff0c\u8fd8\u53ef\u4ee5\u548c`MacBertCorrector`\u4e00\u8d77\u4f7f\u7528\uff0c\u4e5f\u53ef\u4ee5\u72ec\u7acb\u4f7f\u7528\u3002\u793a\u4f8b\u4ee3\u7801 [examples/macbert/model_correction_pipeline_demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/macbert/model_correction_pipeline_demo.py)\n\n#### \u81ea\u5b9a\u4e49\u8bed\u8a00\u6a21\u578b\n\n\u9ed8\u8ba4\u63d0\u4f9b\u4e0b\u8f7d\u5e76\u4f7f\u7528\u7684kenlm\u8bed\u8a00\u6a21\u578b`zh_giga.no_cna_cmn.prune01244.klm`\u6587\u4ef6\u662f2.8G\uff0c\u5185\u5b58\u5c0f\u7684\u7535\u8111\u4f7f\u7528`pycorrector`\u7a0b\u5e8f\u53ef\u80fd\u4f1a\u5403\u529b\u4e9b\u3002\n\n\u652f\u6301\u7528\u6237\u52a0\u8f7d\u81ea\u5df1\u8bad\u7ec3\u7684kenlm\u8bed\u8a00\u6a21\u578b\uff0c\u6216\u4f7f\u75282014\u7248\u4eba\u6c11\u65e5\u62a5\u6570\u636e\u8bad\u7ec3\u7684\u6a21\u578b\uff0c\u6a21\u578b\u5c0f\uff08140M\uff09\uff0c\u51c6\u786e\u7387\u7a0d\u4f4e\uff0c\u6a21\u578b\u4e0b\u8f7d\u5730\u5740\uff1a[shibing624/chinese-kenlm-klm](https://huggingface.co/shibing624/chinese-kenlm-klm) | [people2014corpus_chars.klm(\u5bc6\u7801o5e9)](https://pan.baidu.com/s/1I2GElyHy_MAdek3YaziFYw)\u3002\n\nexample\uff1a[examples/kenlm/load_custom_language_model.py](https://github.com/shibing624/pycorrector/blob/master/examples/kenlm/load_custom_language_model.py)\n\n```python\nfrom pycorrector import Corrector\nmodel = Corrector(language_model_path='people2014corpus_chars.klm')\nprint(model.correct('\u5c11\u5148\u961f\u5458\u56e0\u8be5\u4e3a\u8001\u4eba\u8ba9\u5750'))\n```\n\n#### \u82f1\u6587\u62fc\u5199\u7ea0\u9519\n\n\u652f\u6301\u82f1\u6587\u5355\u8bcd\u7ea7\u522b\u7684\u62fc\u5199\u9519\u8bef\u7ea0\u6b63\u3002\n\nexample\uff1a[examples/kenlm/en_correct_demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/kenlm/en_correct_demo.py)\n\n```python\nfrom pycorrector import EnSpellCorrector\nm = EnSpellCorrector()\nsent = \"what happending? how to speling it, can you gorrect it?\"\nprint(m.correct(sent))\n```\n\noutput:\n\n```\n{'source': 'what happending? how to speling it, can you gorrect it?', 'target': 'what happening? how to spelling it, can you correct it?', 'errors': [('happending', 'happening', 5), ('speling', 'spelling', 24), ('gorrect', 'correct', 44)]}\n```\n\n#### \u4e2d\u6587\u7b80\u7e41\u4e92\u6362\n\n\u652f\u6301\u4e2d\u6587\u7e41\u4f53\u5230\u7b80\u4f53\u7684\u8f6c\u6362\uff0c\u548c\u7b80\u4f53\u5230\u7e41\u4f53\u7684\u8f6c\u6362\u3002\n\nexample\uff1a[examples/kenlm/traditional_simplified_chinese_demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/kenlm/traditional_simplified_chinese_demo.py)\n\n```python\nimport pycorrector\n\ntraditional_sentence = '\u6182\u90c1\u7684\u81fa\u7063\u70cf\u9f9c'\nsimplified_sentence = pycorrector.traditional2simplified(traditional_sentence)\nprint(traditional_sentence, '=>', simplified_sentence)\n\nsimplified_sentence = '\u5fe7\u90c1\u7684\u53f0\u6e7e\u4e4c\u9f9f'\ntraditional_sentence = pycorrector.simplified2traditional(simplified_sentence)\nprint(simplified_sentence, '=>', traditional_sentence)\n```\n\noutput:\n\n```\n\u6182\u90c1\u7684\u81fa\u7063\u70cf\u9f9c => \u5fe7\u90c1\u7684\u53f0\u6e7e\u4e4c\u9f9f\n\u5fe7\u90c1\u7684\u53f0\u6e7e\u4e4c\u9f9f => \u6182\u90c1\u7684\u81fa\u7063\u70cf\u9f9c\n```\n\n#### \u547d\u4ee4\u884c\u6a21\u5f0f\n\n\u652f\u6301kenlm\u65b9\u6cd5\u7684\u6279\u91cf\u6587\u672c\u7ea0\u9519\n\n```\npython -m pycorrector -h\nusage: __main__.py [-h] -o OUTPUT [-n] [-d] input\n\n@description:\n\npositional arguments:\n  input                 the input file path, file encode need utf-8.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -o OUTPUT, --output OUTPUT\n                        the output file path.\n  -n, --no_char         disable char detect mode.\n  -d, --detail          print detail info\n```\n\ncase\uff1a\n\n```\npython -m pycorrector input.txt -o out.txt -n -d\n```\n\n- \u8f93\u5165\u6587\u4ef6\uff1a`input.txt`\uff1b\u8f93\u51fa\u6587\u4ef6\uff1a`out.txt `\uff1b\u5173\u95ed\u5b57\u7c92\u5ea6\u7ea0\u9519\uff1b\u6253\u5370\u8be6\u7ec6\u7ea0\u9519\u4fe1\u606f\uff1b\u7ea0\u9519\u7ed3\u679c\u4ee5`\\t`\u95f4\u9694\n\n\n### MacBert4CSC\u6a21\u578b\n\n\u57fa\u4e8eMacBERT\u6539\u53d8\u7f51\u7edc\u7ed3\u6784\u7684\u4e2d\u6587\u62fc\u5199\u7ea0\u9519\u6a21\u578b\uff0c\u6a21\u578b\u5df2\u7ecf\u5f00\u6e90\u5728HuggingFace Models\uff1ahttps://huggingface.co/shibing624/macbert4csc-base-chinese\n\n\u6a21\u578b\u7f51\u7edc\u7ed3\u6784\uff1a\n- \u672c\u9879\u76ee\u662f MacBERT \u6539\u53d8\u7f51\u7edc\u7ed3\u6784\u7684\u4e2d\u6587\u6587\u672c\u7ea0\u9519\u6a21\u578b\uff0c\u53ef\u652f\u6301 BERT \u7c7b\u6a21\u578b\u4e3a backbone\n- \u5728\u539f\u751f BERT \u6a21\u578b\u4e0a\u8fdb\u884c\u4e86\u9b54\u6539\uff0c\u8ffd\u52a0\u4e86\u4e00\u4e2a\u5168\u8fde\u63a5\u5c42\u4f5c\u4e3a\u9519\u8bef\u68c0\u6d4b\u5373 [detection](https://github.com/shibing624/pycorrector/blob/c0f31222b7849c452cc1ec207c71e9954bd6ca08/pycorrector/macbert/macbert4csc.py#L18) \uff0c\nMacBERT4CSC \u8bad\u7ec3\u65f6\u7528 detection \u5c42\u548c correction \u5c42\u7684 loss \u52a0\u6743\u5f97\u5230\u6700\u7ec8\u7684 loss\uff0c\u9884\u6d4b\u65f6\u7528 BERT MLM \u7684 correction \u6743\u91cd\u5373\u53ef\n\n![macbert_network](https://github.com/shibing624/pycorrector/blob/master/docs/git_image/macbert_network.jpg)\n\n\u8be6\u7ec6\u6559\u7a0b\u53c2\u8003[examples/macbert/README.md](https://github.com/shibing624/pycorrector/blob/master/examples/macbert/README.md)\n\n\n#### pycorrector\u5feb\u901f\u9884\u6d4b\nexample\uff1a[examples/macbert/demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/macbert/demo.py)\n\n```python\nfrom pycorrector import MacBertCorrector\nm = MacBertCorrector(\"shibing624/macbert4csc-base-chinese\")\nprint(m.correct_batch(['\u4eca\u5929\u65b0\u60c5\u5f88\u597d', '\u4f60\u627e\u5230\u4f60\u6700\u559c\u6b22\u7684\u5de5\u4f5c\uff0c\u6211\u4e5f\u5f88\u9ad8\u5fc3\u3002']))\n```\n\noutput\uff1a\n\n```bash\n{'source': '\u4eca\u5929\u65b0\u60c5\u5f88\u597d', 'target': '\u4eca\u5929\u5fc3\u60c5\u5f88\u597d', 'errors': [('\u65b0', '\u5fc3', 2)]}\n{'source': '\u4f60\u627e\u5230\u4f60\u6700\u559c\u6b22\u7684\u5de5\u4f5c\uff0c\u6211\u4e5f\u5f88\u9ad8\u5fc3\u3002', 'target': '\u4f60\u627e\u5230\u4f60\u6700\u559c\u6b22\u7684\u5de5\u4f5c\uff0c\u6211\u4e5f\u5f88\u9ad8\u5174\u3002', 'errors': [('\u5fc3', '\u5174', 15)]}\n```\n\n#### transformers\u5feb\u901f\u9884\u6d4b\n\u89c1[examples/macbert/README.md](https://github.com/shibing624/pycorrector/blob/master/examples/macbert/README.md)\n\n### T5\u6a21\u578b\n\n\u57fa\u4e8eT5\u7684\u4e2d\u6587\u62fc\u5199\u7ea0\u9519\u6a21\u578b\uff0c\u6a21\u578b\u8bad\u7ec3\u8be6\u7ec6\u6559\u7a0b\u53c2\u8003[examples/t5/README.md](https://github.com/shibing624/pycorrector/blob/master/examples/t5/README.md)\n\n#### pycorrector\u5feb\u901f\u9884\u6d4b\nexample\uff1a[examples/t5/demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/t5/demo.py)\n```python\nfrom pycorrector import T5Corrector\nm = T5Corrector()\nprint(m.correct_batch(['\u4eca\u5929\u65b0\u60c5\u5f88\u597d', '\u4f60\u627e\u5230\u4f60\u6700\u559c\u6b22\u7684\u5de5\u4f5c\uff0c\u6211\u4e5f\u5f88\u9ad8\u5fc3\u3002']))\n```\n\noutput:\n\n```\n[{'source': '\u4eca\u5929\u65b0\u60c5\u5f88\u597d', 'target': '\u4eca\u5929\u5fc3\u60c5\u5f88\u597d', 'errors': [('\u65b0', '\u5fc3', 2)]},\n{'source': '\u4f60\u627e\u5230\u4f60\u6700\u559c\u6b22\u7684\u5de5\u4f5c\uff0c\u6211\u4e5f\u5f88\u9ad8\u5fc3\u3002', 'target': '\u4f60\u627e\u5230\u4f60\u6700\u559c\u6b22\u7684\u5de5\u4f5c\uff0c\u6211\u4e5f\u5f88\u9ad8\u5174\u3002', 'errors': [('\u5fc3', '\u5174', 15)]}]\n```\n\n### GPT\u6a21\u578b\n\u57fa\u4e8eChatGLM3\u3001Qwen2.5\u3001Qwen3\u7b49\u6a21\u578b\u5fae\u8c03\u8bad\u7ec3\u7ea0\u9519\u6a21\u578b\uff0c\u8bad\u7ec3\u65b9\u6cd5\u89c1[examples/gpt/README.md](https://github.com/shibing624/pycorrector/blob/master/examples/gpt/README.md)\n\n#### pycorrector\u5feb\u901f\u9884\u6d4b\n\nexample: [examples/gpt/demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/gpt/demo.py)\n```python\nfrom pycorrector.gpt.gpt_corrector import GptCorrector\nm = GptCorrector()\nprint(m.correct_batch(['\u4eca\u5929\u65b0\u60c5\u5f88\u597d', '\u4f60\u627e\u5230\u4f60\u6700\u559c\u6b22\u7684\u5de5\u4f5c\uff0c\u6211\u4e5f\u5f88\u9ad8\u5fc3\u3002']))\n```\n\noutput:\n```shell\n[{'source': '\u4eca\u5929\u65b0\u60c5\u5f88\u597d', 'target': '\u4eca\u5929\u5fc3\u60c5\u5f88\u597d', 'errors': [('\u65b0', '\u5fc3', 2)]},\n{'source': '\u4f60\u627e\u5230\u4f60\u6700\u559c\u6b22\u7684\u5de5\u4f5c\uff0c\u6211\u4e5f\u5f88\u9ad8\u5fc3\u3002', 'target': '\u4f60\u627e\u5230\u4f60\u6700\u559c\u6b22\u7684\u5de5\u4f5c\uff0c\u6211\u4e5f\u5f88\u9ad8\u5174\u3002', 'errors': [('\u5fc3', '\u5174', 15)]}]\n```\n\n### ErnieCSC\u6a21\u578b\n\n\u57fa\u4e8eERNIE\u7684\u4e2d\u6587\u62fc\u5199\u7ea0\u9519\u6a21\u578b\uff0c\u6a21\u578b\u5df2\u7ecf\u5f00\u6e90\u5728[PaddleNLP](https://bj.bcebos.com/paddlenlp/taskflow/text_correction/csc-ernie-1.0/csc-ernie-1.0.pdparams)\u3002\n\u6a21\u578b\u7f51\u7edc\u7ed3\u6784\uff1a\n\n<img src=\"https://user-images.githubusercontent.com/10826371/131974040-fc84ec04-566f-4310-9839-862bfb27172e.png\" width=\"500\" />\n\n\u8be6\u7ec6\u6559\u7a0b\u53c2\u8003[examples/ernie_csc/README.md](https://github.com/shibing624/pycorrector/blob/master/examples/ernie_csc/README.md)\n\n\n\n#### pycorrector\u5feb\u901f\u9884\u6d4b\nexample\uff1a[examples/ernie_csc/demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/ernie_csc/demo.py)\n```python\nfrom pycorrector import ErnieCscCorrector\n\nif __name__ == '__main__':\n    error_sentences = [\n        '\u771f\u9ebb\u70e6\u4f60\u4e86\u3002\u5e0c\u671b\u4f60\u4eec\u597d\u597d\u7684\u8df3\u65e0',\n        '\u5c11\u5148\u961f\u5458\u56e0\u8be5\u4e3a\u8001\u4eba\u8ba9\u5750',\n    ]\n    m = ErnieCscCorrector()\n    batch_res = m.correct_batch(error_sentences)\n    for i in batch_res:\n        print(i)\n        print()\n```\n\noutput:\n\n```\n{'source': '\u771f\u9ebb\u70e6\u4f60\u4e86\u3002\u5e0c\u671b\u4f60\u4eec\u597d\u597d\u7684\u8df3\u65e0', 'target': '\u771f\u9ebb\u70e6\u4f60\u4e86\u3002\u5e0c\u671b\u4f60\u4eec\u597d\u597d\u7684\u8df3\u821e', 'errors': [{'position': 14, 'correction': {'\u65e0': '\u821e'}}]}\n{'source': '\u5c11\u5148\u961f\u5458\u56e0\u8be5\u4e3a\u8001\u4eba\u8ba9\u5750', 'target': '\u5c11\u5148\u961f\u5458\u5e94\u8be5\u4e3a\u8001\u4eba\u8ba9\u5ea7', 'errors': [{'position': 4, 'correction': {'\u56e0': '\u5e94'}}, {'position': 10, 'correction': {'\u5750': '\u5ea7'}}]}\n```\n\n\n\n\n### Bart\u6a21\u578b\n\n\u57fa\u4e8eSIGHAN+Wang271K\u4e2d\u6587\u7ea0\u9519\u6570\u636e\u96c6\u8bad\u7ec3\u7684Bart4CSC\u6a21\u578b\uff0c\u5df2\u7ecfrelease\u5230HuggingFace Models: https://huggingface.co/shibing624/bart4csc-base-chinese\n\n```python\nfrom transformers import BertTokenizerFast\nfrom textgen import BartSeq2SeqModel\n\ntokenizer = BertTokenizerFast.from_pretrained('shibing624/bart4csc-base-chinese')\nmodel = BartSeq2SeqModel(\n    encoder_type='bart',\n    encoder_decoder_type='bart',\n    encoder_decoder_name='shibing624/bart4csc-base-chinese',\n    tokenizer=tokenizer,\n    args={\"max_length\": 128, \"eval_batch_size\": 128})\nsentences = [\"\u5c11\u5148\u961f\u5458\u56e0\u8be5\u4e3a\u8001\u4eba\u8ba9\u5750\"]\nprint(model.predict(sentences))\n```\n\noutput:\n```shell\n['\u5c11\u5148\u961f\u5458\u5e94\u8be5\u4e3a\u8001\u4eba\u8ba9\u5ea7']\n```\n\n\u5982\u679c\u9700\u8981\u8bad\u7ec3Bart\u6a21\u578b\uff0c\u8bf7\u53c2\u8003 https://github.com/shibing624/textgen/blob/main/examples/seq2seq/training_bartseq2seq_zh_demo.py\n\n\n### MuCGECBart\u6a21\u578b\n\n\u6a21\u578b\u5728\u7b2c\u4e00\u6b21\u8fd0\u884c\u65f6\uff0c\u4f1a\u81ea\u52a8\u4e0b\u8f7d\u5230\"~/.cache/modelscope/hub/\"\u5b50\u76ee\u5f55\u3002\n\u6ce8\u610f\u8be5\u6a21\u578b\u5728python=3.8.19\u73af\u5883\u4e0b\u901a\u8fc7\u6d4b\u8bd5\uff0c\u5176\u5b83\u4f9d\u8d56\u5305\u7248\u672c\u53ef\u80fd\u4f1a\u6709\u95ee\u9898\u3002\n\n#### \u5b89\u88c5\u4f9d\u8d56\n```shell\npip install pycorrector modelscope==1.16.0 fairseq==0.12.2\n```\n\n#### \u4f7f\u7528\u793a\u4f8b\n```python\nfrom pycorrector.mucgec_bart.mucgec_bart_corrector import MuCGECBartCorrector\n\n\nif __name__ == \"__main__\":\n    m = MuCGECBartCorrector()\n    result = m.correct_batch(['\u8fd9\u6d0b\u7684\u8bdd\uff0c\u4e0b\u4e00\u5e74\u7684\u798f\u6c14\u6765\u5230\u81ea\u5df1\u8eab\u4e0a\u3002', \n                               '\u5728\u62e5\u6324\u65f6\u95f4\uff0c\u4e3a\u4e86\u8ba9\u4eba\u4eec\u5c0a\u5b88\u4ea4\u901a\u89c4\u5f8b\uff0c\u6d3e\u81f3\u5c11\u4e24\u4e2a\u8b66\u5bdf\u6216\u8005\u4ea4\u901a\u7ba1\u7406\u8005\u3002', \n                               '\u968f\u7740\u4e2d\u56fd\u7ecf\u6d4e\u7a81\u98de\u731b\u8fd1\uff0c\u5efa\u9020\u5de5\u4e1a\u4e0e\u65e5\u4ff1\u589e', \n                               \"\u5317\u4eac\u662f\u4e2d\u56fd\u7684\u90fd\u3002\", \n                               \"\u4ed6\u8bf4\uff1a\u201d\u6211\u6700\u7231\u7684\u8fd0\u52a8\u662f\u6253\u84dd\u7403\u201c\", \n                               \"\u6211\u6bcf\u5929\u5927\u7ea6\u559d5\u6b21\u6c34\u5de6\u53f3\u3002\", \n                               \"\u4eca\u5929\uff0c\u6211\u975e\u5e38\u5f00\u5f00\u5fc3\u3002\"])\n    print(result)\n```\n\noutput:\n```shell\n[{'source': '\u8fd9\u6d0b\u7684\u8bdd\uff0c\u4e0b\u4e00\u5e74\u7684\u798f\u6c14\u6765\u5230\u81ea\u5df1\u8eab\u4e0a\u3002', 'target': '\u8fd9\u6837\u7684\u8bdd\uff0c\u4e0b\u4e00\u5e74\u7684\u798f\u6c14\u5c31\u4f1a\u6765\u5230\u81ea\u5df1\u8eab\u4e0a\u3002', 'errors': [('\u6d0b', '\u6837', 1), ('', '\u5c31\u4f1a', 11)]},\n{'source': '\u5728\u62e5\u6324\u65f6\u95f4\uff0c\u4e3a\u4e86\u8ba9\u4eba\u4eec\u5c0a\u5b88\u4ea4\u901a\u89c4\u5f8b\uff0c\u6d3e\u81f3\u5c11\u4e24\u4e2a\u8b66\u5bdf\u6216\u8005\u4ea4\u901a\u7ba1\u7406\u8005\u3002', 'target': '\u5728\u62e5\u6324\u65f6\u95f4\uff0c\u4e3a\u4e86\u8ba9\u4eba\u4eec\u9075\u5b88\u4ea4\u901a\u89c4\u5219\uff0c\u5e94\u8be5\u6d3e\u81f3\u5c11\u4e24\u4e2a\u8b66\u5bdf\u6216\u8005\u4ea4\u901a\u7ba1\u7406\u8005\u3002', 'errors': [('\u5c0a', '\u9075', 11), ('\u5f8b', '\u5219', 16), ('', '\u5e94\u8be5', 18)]},\n{'source': '\u968f\u7740\u4e2d\u56fd\u7ecf\u6d4e\u7a81\u98de\u731b\u8fd1\uff0c\u5efa\u9020\u5de5\u4e1a\u4e0e\u65e5\u4ff1\u589e', 'target': '\u968f\u7740\u4e2d\u56fd\u7ecf\u6d4e\u7a81\u98de\u731b\u8fdb\uff0c\u5efa\u9020\u5de5\u4e1a\u4e0e\u65e5\u4ff1\u589e', 'errors': [('\u8fd1', '\u8fdb', 9)]},\n{'source': '\u5317\u4eac\u662f\u4e2d\u56fd\u7684\u90fd\u3002', 'target': '\u5317\u4eac\u662f\u4e2d\u56fd\u7684\u9996\u90fd\u3002', 'errors': [('', '\u9996', 6)]},\n{'source': '\u4ed6\u8bf4\uff1a\u201d\u6211\u6700\u7231\u7684\u8fd0\u52a8\u662f\u6253\u84dd\u7403\u201c', 'target': '\u4ed6\u8bf4\uff1a\u201c\u6211\u6700\u7231\u7684\u8fd0\u52a8\u662f\u6253\u7bee\u7403\u201d', 'errors': [('\u201d', '\u201c', 3), ('\u84dd', '\u7bee', 12), ('\u201c', '\u201d', 14)]},\n{'source': '\u6211\u6bcf\u5929\u5927\u7ea6\u559d5\u6b21\u6c34\u5de6\u53f3\u3002', 'target': '\u6211\u6bcf\u5929\u5927\u7ea6\u559d5\u676f\u6c34\u5de6\u53f3\u3002', 'errors': [('\u6b21', '\u676f', 7)]},\n{'source': '\u4eca\u5929\uff0c\u6211\u975e\u5e38\u5f00\u5f00\u5fc3\u3002', 'target': '\u4eca\u5929\uff0c\u6211\u975e\u5e38\u5f00\u5fc3\u3002', 'errors': [('\u5f00', '', 7)]}]\n```\n\n\n\n## Dataset\n\n| \u6570\u636e\u96c6                          | \u8bed\u6599                           |                                                                                \u4e0b\u8f7d\u94fe\u63a5                                                                                 | \u538b\u7f29\u5305\u5927\u5c0f |\n|:-----------------------------|:-----------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----:|\n| **`SIGHAN+Wang271K\u4e2d\u6587\u7ea0\u9519\u6570\u636e\u96c6`** | SIGHAN+Wang271K(27\u4e07\u6761)        |               [\u767e\u5ea6\u7f51\u76d8\uff08\u5bc6\u780101b9\uff09](https://pan.baidu.com/s/1BV5tr9eONZCI0wERFvr0gQ) <br/> [shibing624/CSC](https://huggingface.co/datasets/shibing624/CSC)                | 106M  |\n| **`\u539f\u59cbSIGHAN\u6570\u636e\u96c6`**            | SIGHAN13 14 15               |                                                      [\u5b98\u65b9csc.html](http://nlp.ee.ncu.edu.tw/resource/csc.html)                                                       | 339K  |\n| **`\u539f\u59cbWang271K\u6570\u636e\u96c6`**          | Wang271K                     |                   [Automatic-Corpus-Generation dimmywang\u63d0\u4f9b](https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml)                    |  93M  |\n| **`\u4eba\u6c11\u65e5\u62a52014\u7248\u8bed\u6599`**            | \u4eba\u6c11\u65e5\u62a52014\u7248                    |                                    [\u98de\u4e66\uff08\u5bc6\u7801cHcu\uff09](https://l6pmn3b1eo.feishu.cn/file/boxcnKpildqIseq1D4IrLwlir7c?from=from_qr_code)                                    | 383M  |\n| **`NLPCC 2018 GEC\u5b98\u65b9\u6570\u636e\u96c6`**    | NLPCC2018-GEC                |                                        [\u5b98\u65b9trainingdata](http://tcci.ccf.org.cn/conference/2018/dldoc/trainingdata02.tar.gz)                                         | 114M  |\n| **`NLPCC 2018+HSK\u719f\u8bed\u6599`**      | nlpcc2018+hsk+CGED           | [\u767e\u5ea6\u7f51\u76d8\uff08\u5bc6\u7801m6fg\uff09](https://pan.baidu.com/s/1BkDru60nQXaDVLRSr7ktfA) <br/> [\u98de\u4e66\uff08\u5bc6\u7801gl9y\uff09](https://l6pmn3b1eo.feishu.cn/file/boxcnudJgRs5GEMhZwe77YGTQfc?from=from_qr_code) | 215M  |\n| **`NLPCC 2018+HSK\u539f\u59cb\u8bed\u6599`**     | HSK+Lang8                    | [\u767e\u5ea6\u7f51\u76d8\uff08\u5bc6\u7801n31j\uff09](https://pan.baidu.com/s/1DaOX89uL1JRaZclfrV9C0g) <br/> [\u98de\u4e66\uff08\u5bc6\u7801Q9LH\uff09](https://l6pmn3b1eo.feishu.cn/file/boxcntebW3NI6OAaqzDUXlZHoDb?from=from_qr_code) |  81M  |\n| **`\u4e2d\u6587\u7ea0\u9519\u6bd4\u8d5b\u6570\u636e\u6c47\u603b`**             | Chinese Text Correction\uff08CTC\uff09 |                                                     [\u4e2d\u6587\u7ea0\u9519\u6c47\u603b\u6570\u636e\u96c6\uff08\u5929\u6c60\uff09](https://tianchi.aliyun.com/dataset/138195)                                                      |   -   |\n| **`NLPCC 2023\u4e2d\u6587\u8bed\u6cd5\u7ea0\u9519\u6570\u636e\u96c6`**    | NLPCC 2023 Sharedtask1       |                          [Task 1: Chinese Grammatical Error Correction\uff08Training Set\uff09](http://tcci.ccf.org.cn/conference/2023/taskdata.php)                          | 125M  |\n| **`\u767e\u5ea6\u667a\u80fd\u6587\u672c\u6821\u5bf9\u6bd4\u8d5b\u6570\u636e\u96c6`**          | \u4e2d\u6587\u771f\u5b9e\u573a\u666f\u7ea0\u9519\u6570\u636e                   |                          [shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)                          |  10M  |\n| **`200\u4e07\u4e2d\u6587\u7ea0\u9519\u6570\u636e\u96c6`**            | \u4e2d\u6587\u8bed\u6cd5\u548c\u62fc\u5199\u7ea0\u9519\u6570\u636e                  |                          [twnlp/ChinseseErrorCorrectData](https://huggingface.co/datasets/twnlp/ChinseseErrorCorrectData)                          |  2M   |\n\n\n\n\u8bf4\u660e\uff1a\n\n- SIGHAN+Wang271K\u4e2d\u6587\u7ea0\u9519\u6570\u636e\u96c6(27\u4e07\u6761)\uff0c\u662f\u901a\u8fc7\u539f\u59cbSIGHAN13\u300114\u300115\u5e74\u6570\u636e\u96c6\u548cWang271K\u6570\u636e\u96c6\u683c\u5f0f\u8f6c\u5316\u540e\u5f97\u5230\uff0cjson\u683c\u5f0f\uff0c\u5e26\u9519\u8bef\u5b57\u7b26\u4f4d\u7f6e\u4fe1\u606f\uff0cSIGHAN\u4e3atest.json\uff0c\n  macbert4csc\u6a21\u578b\u8bad\u7ec3\u53ef\u4ee5\u76f4\u63a5\u7528\u8be5\u6570\u636e\u96c6\u590d\u73b0paper\u51c6\u53ec\u7ed3\u679c\uff0c\u8be6\u89c1[pycorrector/macbert/README.md](pycorrector/macbert/README.md)\u3002\n- NLPCC 2018 GEC\u5b98\u65b9\u6570\u636e\u96c6[NLPCC2018-GEC](http://tcci.ccf.org.cn/conference/2018/taskdata.php)\uff0c\n  \u8bad\u7ec3\u96c6[trainingdata](http://tcci.ccf.org.cn/conference/2018/dldoc/trainingdata02.tar.gz)[\u89e3\u538b\u540e114.5MB]\uff0c\u8be5\u6570\u636e\u683c\u5f0f\u662f\u539f\u59cb\u6587\u672c\uff0c\u672a\u505a\u5207\u8bcd\u5904\u7406\u3002\n- \u6c49\u8bed\u6c34\u5e73\u8003\u8bd5\uff08HSK\uff09\u548clang8\u539f\u59cb\u5e73\u884c\u8bed\u6599[HSK+Lang8][\u767e\u5ea6\u7f51\u76d8\uff08\u5bc6\u7801n31j\uff09](https://pan.baidu.com/s/1DaOX89uL1JRaZclfrV9C0g)\uff0c\u8be5\u6570\u636e\u96c6\u5df2\u7ecf\u5207\u8bcd\uff0c\u53ef\u7528\u4f5c\u6570\u636e\u6269\u589e\u3002\n- NLPCC 2018 + HSK + CGED16\u300117\u300118\u7684\u6570\u636e\uff0c\u7ecf\u8fc7\u4ee5\u5b57\u5207\u5206\uff0c\u7e41\u4f53\u8f6c\u7b80\u4f53\uff0c\u6253\u4e71\u6570\u636e\u987a\u5e8f\u7684\u9884\u5904\u7406\u540e\uff0c\u751f\u6210\u7528\u4e8e\u7ea0\u9519\u7684\u719f\u8bed\u6599(nlpcc2018+hsk)\n  \uff0c[\u767e\u5ea6\u7f51\u76d8\uff08\u5bc6\u7801:m6fg\uff09](https://pan.baidu.com/s/1BkDru60nQXaDVLRSr7ktfA) [130\u4e07\u5bf9\u53e5\u5b50\uff0c215MB]\n\nSIGHAN+Wang271K\u4e2d\u6587\u7ea0\u9519\u6570\u636e\u96c6\uff0c\u6570\u636e\u683c\u5f0f\uff1a\n```json\n[\n    {\n        \"id\": \"B2-4029-3\",\n        \"original_text\": \"\u665a\u95f4\u4f1a\u542c\u5230\u55d3\u97f3\uff0c\u767d\u5929\u7684\u65f6\u5019\u5927\u5bb6\u90fd\u4e0d\u4f1a\u592a\u5728\u610f\uff0c\u4f46\u662f\u5728\u7761\u89c9\u7684\u65f6\u5019\u8fd9\u55d3\u97f3\u6210\u4e3a\u5927\u5bb6\u7684\u6076\u68a6\u3002\",\n        \"wrong_ids\": [\n            5,\n            31\n        ],\n        \"correct_text\": \"\u665a\u95f4\u4f1a\u542c\u5230\u566a\u97f3\uff0c\u767d\u5929\u7684\u65f6\u5019\u5927\u5bb6\u90fd\u4e0d\u4f1a\u592a\u5728\u610f\uff0c\u4f46\u662f\u5728\u7761\u89c9\u7684\u65f6\u5019\u8fd9\u566a\u97f3\u6210\u4e3a\u5927\u5bb6\u7684\u6076\u68a6\u3002\"\n    }\n]\n```\n\n\u5b57\u6bb5\u89e3\u91ca\uff1a\n- id\uff1a\u552f\u4e00\u6807\u8bc6\u7b26\uff0c\u65e0\u610f\u4e49\n- original_text: \u539f\u59cb\u9519\u8bef\u6587\u672c\n- wrong_ids\uff1a \u9519\u8bef\u5b57\u7684\u4f4d\u7f6e\uff0c\u4ece0\u5f00\u59cb\n- correct_text: \u7ea0\u6b63\u540e\u7684\u6587\u672c\n\n#### \u81ea\u6709\u6570\u636e\u96c6\n\n\u53ef\u4ee5\u4f7f\u7528\u81ea\u5df1\u6570\u636e\u96c6\u8bad\u7ec3\u7ea0\u9519\u6a21\u578b\uff0c\u628a\u81ea\u5df1\u6570\u636e\u96c6\u6807\u6ce8\u597d\uff0c\u4fdd\u5b58\u4e3a\u8ddf\u8bad\u7ec3\u6837\u672c\u96c6\u4e00\u6837\u7684json\u683c\u5f0f\uff0c\u7136\u540e\u52a0\u8f7d\u6570\u636e\u8bad\u7ec3\u6a21\u578b\u5373\u53ef\u3002\n\n1. \u5df2\u6709\u5927\u91cf\u4e1a\u52a1\u76f8\u5173\u9519\u8bef\u6837\u672c\uff0c\u4e3b\u8981\u6807\u6ce8\u9519\u8bef\u4f4d\u7f6e\uff08wrong_ids\uff09\u548c\u7ea0\u9519\u540e\u7684\u53e5\u5b50(correct_text)\n2. \u6ca1\u6709\u73b0\u6210\u7684\u9519\u8bef\u6837\u672c\uff0c\u53ef\u4ee5\u5199\u811a\u672c\u751f\u6210\u9519\u8bef\u6837\u672c\uff08original_text\uff09\uff0c\u6839\u636e\u97f3\u4f3c\u3001\u5f62\u4f3c\u7b49\u7279\u5f81\u628a\u6b63\u786e\u53e5\u5b50\u7684\u6307\u5b9a\u4f4d\u7f6e\uff08wrong_ids\uff09\u5b57\u7b26\u6539\u4e3a\u9519\u5b57\uff0c\u9644\u4e0a\n\u7b2c\u4e09\u65b9\u540c\u97f3\u5b57\u751f\u6210\u811a\u672c[\u540c\u97f3\u8bcd\u66ff\u6362](https://github.com/dongrixinyu/JioNLP/wiki/%E6%95%B0%E6%8D%AE%E5%A2%9E%E5%BC%BA-%E8%AF%B4%E6%98%8E%E6%96%87%E6%A1%A3#%E5%90%8C%E9%9F%B3%E8%AF%8D%E6%9B%BF%E6%8D%A2)\n\n\n### Language Model\n\n[\u4ec0\u4e48\u662f\u8bed\u8a00\u6a21\u578b\uff1f-wiki](https://github.com/shibing624/pycorrector/wiki/%E7%BB%9F%E8%AE%A1%E8%AF%AD%E8%A8%80%E6%A8%A1%E5%9E%8B%E5%8E%9F%E7%90%86)\n\n\u8bed\u8a00\u6a21\u578b\u5bf9\u4e8e\u7ea0\u9519\u6b65\u9aa4\u81f3\u5173\u91cd\u8981\uff0c\u5f53\u524d\u9ed8\u8ba4\u4f7f\u7528\u7684\u662f\u4ece\u5343\u5146\u4e2d\u6587\u6587\u672c\u8bad\u7ec3\u7684\u4e2d\u6587\u8bed\u8a00\u6a21\u578b[zh_giga.no_cna_cmn.prune01244.klm(2.8G)](https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm)\uff0c\n\u63d0\u4f9b\u4eba\u6c11\u65e5\u62a52014\u7248\u8bed\u6599\u8bad\u7ec3\u5f97\u5230\u7684\u8f7b\u91cf\u7248\u8bed\u8a00\u6a21\u578b[people2014corpus_chars.klm(\u5bc6\u7801o5e9)](https://pan.baidu.com/s/1I2GElyHy_MAdek3YaziFYw)\u3002\n\n\u5927\u5bb6\u53ef\u4ee5\u7528\u4e2d\u6587\u7ef4\u57fa\uff08\u7e41\u4f53\u8f6c\u7b80\u4f53\uff0cpycorrector.utils.text_utils\u4e0b\u6709\u6b64\u529f\u80fd\uff09\u7b49\u8bed\u6599\u6570\u636e\u8bad\u7ec3\u901a\u7528\u7684\u8bed\u8a00\u6a21\u578b\uff0c\u6216\u8005\u4e5f\u53ef\u4ee5\u7528\u4e13\u4e1a\u9886\u57df\u8bed\u6599\u8bad\u7ec3\u66f4\u4e13\u7528\u7684\u8bed\u8a00\u6a21\u578b\u3002\u66f4\u9002\u7528\u7684\u8bed\u8a00\u6a21\u578b\uff0c\u5bf9\u4e8e\u7ea0\u9519\u6548\u679c\u4f1a\u6709\u6bd4\u8f83\u597d\u7684\u63d0\u5347\u3002\n\n1. kenlm\u8bed\u8a00\u6a21\u578b\u8bad\u7ec3\u5de5\u5177\u7684\u4f7f\u7528\uff0c\u8bf7\u89c1\u535a\u5ba2\uff1ahttp://blog.csdn.net/mingzai624/article/details/79560063\n2. 16GB\u4e2d\u82f1\u6587\u65e0\u76d1\u7763\u3001\u5e73\u884c\u8bed\u6599[Linly-AI/Chinese-pretraining-dataset](https://huggingface.co/datasets/Linly-AI/Chinese-pretraining-dataset)\n3. 524MB\u4e2d\u6587\u7ef4\u57fa\u767e\u79d1\u8bed\u6599[wikipedia-cn-20230720-filtered](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered)\n\n\n\n## Contact\n\n- Github Issue(\u5efa\u8bae)\uff1a[![GitHub issues](https://img.shields.io/github/issues/shibing624/pycorrector.svg)](https://github.com/shibing624/pycorrector/issues)\n- Github discussions\uff1a\u6b22\u8fce\u5230\u8ba8\u8bba\u533a[![GitHub discussions](https://img.shields.io/github/discussions/shibing624/pycorrector.svg)](https://github.com/shibing624/pycorrector/discussions)\u704c\u6c34\uff08\u4e0d\u4f1a\u6253\u6270\u5f00\u53d1\u8005\uff09\uff0c\u516c\u5f00\u4ea4\u6d41\u7ea0\u9519\u6280\u672f\u548c\u95ee\u9898\n- \u90ae\u4ef6\u6211\uff1axuming: xuming624@qq.com\n- \u5fae\u4fe1\u6211\uff1a\u52a0\u6211*\u5fae\u4fe1\u53f7\uff1axuming624*, \u8fdbPython-NLP\u4ea4\u6d41\u7fa4\uff0c\u5907\u6ce8\uff1a*\u59d3\u540d-\u516c\u53f8\u540d-NLP*\n\n\n<img src=\"https://github.com/shibing624/pycorrector/blob/master/docs/git_image/wechat.jpeg\" width=\"200\" />\n\n<img src=\"https://github.com/shibing624/pycorrector/blob/master/docs/git_image/wechat_group.jpg\" width=\"200\" />\n\n## Citation\n\n\u5982\u679c\u4f60\u5728\u7814\u7a76\u4e2d\u4f7f\u7528\u4e86pycorrector\uff0c\u8bf7\u6309\u5982\u4e0b\u683c\u5f0f\u5f15\u7528\uff1a\n\nAPA:\n```latex\nXu, M. Pycorrector: Text error correction tool (Version 0.4.2) [Computer software]. https://github.com/shibing624/pycorrector\n```\n\nBibTeX:\n```latex\n@misc{Xu_Pycorrector_Text_error,\n  title={Pycorrector: Text error correction tool},\n  author={Ming Xu},\n  year={2023},\n  howpublished={\\url{https://github.com/shibing624/pycorrector}},\n}\n```\n\n\n\n## License\n\npycorrector \u7684\u6388\u6743\u534f\u8bae\u4e3a **Apache License 2.0**\uff0c\u53ef\u514d\u8d39\u7528\u505a\u5546\u4e1a\u7528\u9014\u3002\u8bf7\u5728\u4ea7\u54c1\u8bf4\u660e\u4e2d\u9644\u52a0pycorrector\u7684\u94fe\u63a5\u548c\u6388\u6743\u534f\u8bae\u3002\n\n## Contribute\n\n\u9879\u76ee\u4ee3\u7801\u8fd8\u5f88\u7c97\u7cd9\uff0c\u5982\u679c\u5927\u5bb6\u5bf9\u4ee3\u7801\u6709\u6240\u6539\u8fdb\uff0c\u6b22\u8fce\u63d0\u4ea4\u56de\u672c\u9879\u76ee\uff0c\u5728\u63d0\u4ea4\u4e4b\u524d\uff0c\u6ce8\u610f\u4ee5\u4e0b\u4e24\u70b9\uff1a\n\n- \u5728`tests`\u6dfb\u52a0\u76f8\u5e94\u7684\u5355\u5143\u6d4b\u8bd5\n- \u4f7f\u7528`python -m pytest`\u6765\u8fd0\u884c\u6240\u6709\u5355\u5143\u6d4b\u8bd5\uff0c\u786e\u4fdd\u6240\u6709\u5355\u6d4b\u90fd\u662f\u901a\u8fc7\u7684\n\n\u4e4b\u540e\u5373\u53ef\u63d0\u4ea4PR\u3002\n\n## References\n\n* [\u57fa\u4e8e\u6587\u6cd5\u6a21\u578b\u7684\u4e2d\u6587\u7ea0\u9519\u7cfb\u7edf](https://blog.csdn.net/mingzai624/article/details/82390382)\n* [Norvig\u2019s spelling corrector](http://norvig.com/spell-correct.html)\n* [Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape[Yu, 2013]](http://www.aclweb.org/anthology/W/W14/W14-6835.pdf)\n* [Chinese Spelling Checker Based on Statistical Machine Translation[Chiu, 2013]](http://www.aclweb.org/anthology/O/O13/O13-1005.pdf)\n* [Chinese Word Spelling Correction Based on Rule Induction[yeh, 2014]](http://aclweb.org/anthology/W14-6822)\n* [Neural Language Correction with Character-Based Attention[Ziang Xie, 2016]](https://arxiv.org/pdf/1603.09727.pdf)\n* [Chinese Spelling Check System Based on Tri-gram Model[Qiang Huang, 2014]](http://www.anthology.aclweb.org/W/W14/W14-6827.pdf)\n* [Neural Abstractive Text Summarization with Sequence-to-Sequence Models[Tian Shi, 2018]](https://arxiv.org/abs/1812.02303)\n* [\u57fa\u4e8e\u6df1\u5ea6\u5b66\u4e60\u7684\u4e2d\u6587\u6587\u672c\u81ea\u52a8\u6821\u5bf9\u7814\u7a76\u4e0e\u5b9e\u73b0[\u6768\u5b97\u9716, 2019]](https://github.com/shibing624/pycorrector/blob/master/docs/\u57fa\u4e8e\u6df1\u5ea6\u5b66\u4e60\u7684\u4e2d\u6587\u6587\u672c\u81ea\u52a8\u6821\u5bf9\u7814\u7a76\u4e0e\u5b9e\u73b0.pdf)\n* [A Sequence to Sequence Learning for Chinese Grammatical Error Correction[Hongkai Ren, 2018]](https://link.springer.com/chapter/10.1007/978-3-319-99501-4_36)\n* [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/pdf?id=r1xMH1BtvB)\n* [Revisiting Pre-trained Models for Chinese Natural Language Processing](https://arxiv.org/abs/2004.13922)\n* Ruiqing Zhang, Chao Pang et al. \"Correcting Chinese Spelling Errors with Phonetic Pre-training\", ACL, 2021\n* DingminWang et al. \"A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check\", EMNLP, 2018\n* [MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction](https://aclanthology.org/2022.naacl-main.227) (Zhang et al., NAACL 2022)\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Chinese Text Error Corrector",
    "version": "1.1.3",
    "project_urls": {
        "Homepage": "https://github.com/shibing624/pycorrector"
    },
    "split_keywords": [
        "pycorrector",
        " correction",
        " chinese error correction",
        " nlp"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3db363187b2a3fdebc37eaa1a01f0bf9ff81ddd290cbfd6f0b82785126e908df",
                "md5": "b19fac470d4f26dfee103c5eda2fd7be",
                "sha256": "e9364d2920d53a16b3e9c5823f52c691296ab1ee7acd98c992482ff58714c5a8"
            },
            "downloads": -1,
            "filename": "pycorrector-1.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "b19fac470d4f26dfee103c5eda2fd7be",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 4369779,
            "upload_time": "2025-07-09T02:40:33",
            "upload_time_iso_8601": "2025-07-09T02:40:33.531371Z",
            "url": "https://files.pythonhosted.org/packages/3d/b3/63187b2a3fdebc37eaa1a01f0bf9ff81ddd290cbfd6f0b82785126e908df/pycorrector-1.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-09 02:40:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "shibing624",
    "github_project": "pycorrector",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "jieba",
            "specs": []
        },
        {
            "name": "pypinyin",
            "specs": []
        },
        {
            "name": "transformers",
            "specs": []
        },
        {
            "name": "datasets",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "six",
            "specs": []
        },
        {
            "name": "loguru",
            "specs": []
        },
        {
            "name": "pyahocorasick",
            "specs": []
        }
    ],
    "lcname": "pycorrector"
}

XuMing