## Text Normalization & Inverse Text Normalization
### 0. Brief Introduction
```diff
- **Must Read Doc** (In Chinese): https://mp.weixin.qq.com/s/q_11lck78qcjylHCi6wVsQ
```
[WeTextProcessing: Production First & Production Ready Text Processing Toolkit](https://mp.weixin.qq.com/s/q_11lck78qcjylHCi6wVsQ)
#### 0.1 Text Normalization
<div align=center><img src="https://user-images.githubusercontent.com/13466943/193439861-acfba531-13d1-4fca-b2f2-6e47fc10f195.png" alt="Cover" width="50%"/></div>
#### 0.2 Inverse Text Normalization
<div align=center><img src="https://user-images.githubusercontent.com/13466943/193439870-634c44a3-bd62-4311-bcf2-1427758d5f62.png" alt="Cover" width="50%"/></div>
### 1. How To Use
#### 1.1 Quick Start:
```bash
# install
pip install WeTextProcessing
```
Command-usage:
```bash
wetn --text "2.5平方电线"
weitn --text "二点五平方电线"
```
Python usage:
```py
from itn.chinese.inverse_normalizer import InverseNormalizer
from tn.chinese.normalizer import Normalizer as ZhNormalizer
from tn.english.normalizer import Normalizer as EnNormalizer
# NOTE(xcsong): 和默认参数不一致时,必须重新构图,要重新构图请务必指定 `overwrite_cache=True`
# When the parameters differ from the defaults, it is mandatory to re-compose. To re-compose, please ensure you specify `overwrite_cache=True`.
zh_tn_text = "你好 WeTextProcessing 1.0,船新版本儿,船新体验儿,简直666,9和10"
zh_itn_text = "你好 WeTextProcessing 一点零,船新版本儿,船新体验儿,简直六六六,九和六"
en_tn_text = "Hello WeTextProcessing 1.0, life is short, just use wetext, 666, 9 and 10"
zh_tn_model = ZhNormalizer(remove_erhua=True, overwrite_cache=True)
zh_itn_model = InverseNormalizer(enable_0_to_9=False, overwrite_cache=True)
en_tn_model = EnNormalizer(overwrite_cache=True)
print("中文 TN (去除儿化音,重新在线构图):\n\t{} => {}".format(zh_tn_text, zh_tn_model.normalize(zh_tn_text)))
print("中文ITN (小于10的单独数字不转换,重新在线构图):\n\t{} => {}".format(zh_itn_text, zh_itn_model.normalize(zh_itn_text)))
print("英文 TN (暂时还没有可控的选项,后面会加...):\n\t{} => {}\n".format(en_tn_text, en_tn_model.normalize(en_tn_text)))
zh_tn_model = ZhNormalizer(overwrite_cache=False)
zh_itn_model = InverseNormalizer(overwrite_cache=False)
en_tn_model = EnNormalizer(overwrite_cache=False)
print("中文 TN (复用之前编译好的图):\n\t{} => {}".format(zh_tn_text, zh_tn_model.normalize(zh_tn_text)))
print("中文ITN (复用之前编译好的图):\n\t{} => {}".format(zh_itn_text, zh_itn_model.normalize(zh_itn_text)))
print("英文 TN (复用之前编译好的图):\n\t{} => {}\n".format(en_tn_text, en_tn_model.normalize(en_tn_text)))
zh_tn_model = ZhNormalizer(remove_erhua=False, overwrite_cache=True)
zh_itn_model = InverseNormalizer(enable_0_to_9=True, overwrite_cache=True)
print("中文 TN (不去除儿化音,重新在线构图):\n\t{} => {}".format(zh_tn_text, zh_tn_model.normalize(zh_tn_text)))
print("中文ITN (小于10的单独数字也进行转换,重新在线构图):\n\t{} => {}\n".format(zh_itn_text, zh_itn_model.normalize(zh_itn_text)))
```
#### 1.2 Advanced Usage:
DIY your own rules && Deploy WeTextProcessing with cpp runtime !!
For users who want modifications and adapt tn/itn rules to fix badcase, please try:
``` bash
git clone https://github.com/wenet-e2e/WeTextProcessing.git
cd WeTextProcessing
pip install -r requirements.txt
pre-commit install # for clean and tidy code
# `overwrite_cache` will rebuild all rules according to
# your modifications on tn/chinese/rules/xx.py (itn/chinese/rules/xx.py).
# After rebuild, you can find new far files at `$PWD/tn` and `$PWD/itn`.
python -m tn --text "2.5平方电线" --overwrite_cache
python -m itn --text "二点五平方电线" --overwrite_cache
```
Once you successfully rebuild your rules, you can deploy them either with your installed pypi packages:
```py
# tn usage
>>> from tn.chinese.normalizer import Normalizer
>>> normalizer = Normalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn")
>>> normalizer.normalize("2.5平方电线")
# itn usage
>>> from itn.chinese.inverse_normalizer import InverseNormalizer
>>> invnormalizer = InverseNormalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn")
>>> invnormalizer.normalize("二点五平方电线")
```
Or with cpp runtime:
```bash
cmake -B build -S runtime -DCMAKE_BUILD_TYPE=Release
cmake --build build
# tn usage
cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn
./build/processor_main --tagger $cache_dir/zh_tn_tagger.fst --verbalizer $cache_dir/zh_tn_verbalizer.fst --text "2.5平方电线"
# itn usage
cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn
./build/processor_main --tagger $cache_dir/zh_itn_tagger.fst --verbalizer $cache_dir/zh_itn_verbalizer.fst --text "二点五平方电线"
```
### 2. TN Pipeline
Please refer to [TN.README](tn/README.md)
### 3. ITN Pipeline
Please refer to [ITN.README](itn/README.md)
## Discussion & Communication
For Chinese users, you can aslo scan the QR code on the left to follow our offical account of WeNet.
We created a WeChat group for better discussion and quicker response.
Please scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group.
| <img src="https://github.com/robin1001/qr/blob/master/wenet.jpeg" width="250px"> | <img src="https://user-images.githubusercontent.com/13466943/203046432-f637180e-4c87-40cc-be05-ce48c65dd1ef.jpg" width="250px"> |
| ---- | ---- |
Or you can directly discuss on [Github Issues](https://github.com/wenet-e2e/WeTextProcessing/issues).
## Acknowledge
1. Thank the authors of foundational libraries like [OpenFst](https://www.openfst.org/twiki/bin/view/FST/WebHome) & [Pynini](https://www.openfst.org/twiki/bin/view/GRM/Pynini).
3. Thank [NeMo](https://github.com/NVIDIA/NeMo) team & NeMo open-source community.
2. Thank [Zhenxiang Ma](https://github.com/mzxcpp), [Jiayu Du](https://github.com/dophist), and [SpeechColab](https://github.com/SpeechColab) organization.
3. Referred [Pynini](https://github.com/kylebgorman/pynini) for reading the FAR, and printing the shortest path of a lattice in the C++ runtime.
4. Referred [TN of NeMo](https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing/text_normalization/zh) for the data to build the tagger graph.
5. Referred [ITN of chinese_text_normalization](https://github.com/speechio/chinese_text_normalization/tree/master/thrax/src/cn) for the data to build the tagger graph.
Raw data
{
"_id": null,
"home_page": "https://github.com/wenet-e2e/WeTextProcessing",
"name": "WeTextProcessing",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Zhendong Peng, Xingchen Song",
"author_email": "pzd17@tsinghua.org.cn, sxc19@tsinghua.org.cn",
"download_url": "https://files.pythonhosted.org/packages/9b/28/edcb748b327b33c18ff5b4ea80e9e86b4ebd250c95de6448b98f7bee9659/WeTextProcessing-1.0.3.tar.gz",
"platform": null,
"description": "## Text Normalization & Inverse Text Normalization\n\n### 0. Brief Introduction\n\n```diff\n- **Must Read Doc** (In Chinese): https://mp.weixin.qq.com/s/q_11lck78qcjylHCi6wVsQ\n```\n\n[WeTextProcessing: Production First & Production Ready Text Processing Toolkit](https://mp.weixin.qq.com/s/q_11lck78qcjylHCi6wVsQ)\n\n#### 0.1 Text Normalization\n\n<div align=center><img src=\"https://user-images.githubusercontent.com/13466943/193439861-acfba531-13d1-4fca-b2f2-6e47fc10f195.png\" alt=\"Cover\" width=\"50%\"/></div>\n\n#### 0.2 Inverse Text Normalization\n\n<div align=center><img src=\"https://user-images.githubusercontent.com/13466943/193439870-634c44a3-bd62-4311-bcf2-1427758d5f62.png\" alt=\"Cover\" width=\"50%\"/></div>\n\n### 1. How To Use\n\n#### 1.1 Quick Start:\n```bash\n# install\npip install WeTextProcessing\n```\n\nCommand-usage:\n\n```bash\nwetn --text \"2.5\u5e73\u65b9\u7535\u7ebf\"\nweitn --text \"\u4e8c\u70b9\u4e94\u5e73\u65b9\u7535\u7ebf\"\n```\n\nPython usage:\n\n```py\nfrom itn.chinese.inverse_normalizer import InverseNormalizer\nfrom tn.chinese.normalizer import Normalizer as ZhNormalizer\nfrom tn.english.normalizer import Normalizer as EnNormalizer\n\n# NOTE(xcsong): \u548c\u9ed8\u8ba4\u53c2\u6570\u4e0d\u4e00\u81f4\u65f6\uff0c\u5fc5\u987b\u91cd\u65b0\u6784\u56fe\uff0c\u8981\u91cd\u65b0\u6784\u56fe\u8bf7\u52a1\u5fc5\u6307\u5b9a `overwrite_cache=True`\n# When the parameters differ from the defaults, it is mandatory to re-compose. To re-compose, please ensure you specify `overwrite_cache=True`.\n\nzh_tn_text = \"\u4f60\u597d WeTextProcessing 1.0\uff0c\u8239\u65b0\u7248\u672c\u513f\uff0c\u8239\u65b0\u4f53\u9a8c\u513f\uff0c\u7b80\u76f4666\uff0c9\u548c10\"\nzh_itn_text = \"\u4f60\u597d WeTextProcessing \u4e00\u70b9\u96f6\uff0c\u8239\u65b0\u7248\u672c\u513f\uff0c\u8239\u65b0\u4f53\u9a8c\u513f\uff0c\u7b80\u76f4\u516d\u516d\u516d\uff0c\u4e5d\u548c\u516d\"\nen_tn_text = \"Hello WeTextProcessing 1.0, life is short, just use wetext, 666, 9 and 10\"\nzh_tn_model = ZhNormalizer(remove_erhua=True, overwrite_cache=True)\nzh_itn_model = InverseNormalizer(enable_0_to_9=False, overwrite_cache=True)\nen_tn_model = EnNormalizer(overwrite_cache=True)\nprint(\"\u4e2d\u6587 TN (\u53bb\u9664\u513f\u5316\u97f3\uff0c\u91cd\u65b0\u5728\u7ebf\u6784\u56fe):\\n\\t{} => {}\".format(zh_tn_text, zh_tn_model.normalize(zh_tn_text)))\nprint(\"\u4e2d\u6587ITN (\u5c0f\u4e8e10\u7684\u5355\u72ec\u6570\u5b57\u4e0d\u8f6c\u6362\uff0c\u91cd\u65b0\u5728\u7ebf\u6784\u56fe):\\n\\t{} => {}\".format(zh_itn_text, zh_itn_model.normalize(zh_itn_text)))\nprint(\"\u82f1\u6587 TN (\u6682\u65f6\u8fd8\u6ca1\u6709\u53ef\u63a7\u7684\u9009\u9879\uff0c\u540e\u9762\u4f1a\u52a0...):\\n\\t{} => {}\\n\".format(en_tn_text, en_tn_model.normalize(en_tn_text)))\n\nzh_tn_model = ZhNormalizer(overwrite_cache=False)\nzh_itn_model = InverseNormalizer(overwrite_cache=False)\nen_tn_model = EnNormalizer(overwrite_cache=False)\nprint(\"\u4e2d\u6587 TN (\u590d\u7528\u4e4b\u524d\u7f16\u8bd1\u597d\u7684\u56fe):\\n\\t{} => {}\".format(zh_tn_text, zh_tn_model.normalize(zh_tn_text)))\nprint(\"\u4e2d\u6587ITN (\u590d\u7528\u4e4b\u524d\u7f16\u8bd1\u597d\u7684\u56fe):\\n\\t{} => {}\".format(zh_itn_text, zh_itn_model.normalize(zh_itn_text)))\nprint(\"\u82f1\u6587 TN (\u590d\u7528\u4e4b\u524d\u7f16\u8bd1\u597d\u7684\u56fe):\\n\\t{} => {}\\n\".format(en_tn_text, en_tn_model.normalize(en_tn_text)))\n\nzh_tn_model = ZhNormalizer(remove_erhua=False, overwrite_cache=True)\nzh_itn_model = InverseNormalizer(enable_0_to_9=True, overwrite_cache=True)\nprint(\"\u4e2d\u6587 TN (\u4e0d\u53bb\u9664\u513f\u5316\u97f3\uff0c\u91cd\u65b0\u5728\u7ebf\u6784\u56fe):\\n\\t{} => {}\".format(zh_tn_text, zh_tn_model.normalize(zh_tn_text)))\nprint(\"\u4e2d\u6587ITN (\u5c0f\u4e8e10\u7684\u5355\u72ec\u6570\u5b57\u4e5f\u8fdb\u884c\u8f6c\u6362\uff0c\u91cd\u65b0\u5728\u7ebf\u6784\u56fe):\\n\\t{} => {}\\n\".format(zh_itn_text, zh_itn_model.normalize(zh_itn_text)))\n```\n\n#### 1.2 Advanced Usage:\n\nDIY your own rules && Deploy WeTextProcessing with cpp runtime !!\n\nFor users who want modifications and adapt tn/itn rules to fix badcase, please try:\n\n``` bash\ngit clone https://github.com/wenet-e2e/WeTextProcessing.git\ncd WeTextProcessing\npip install -r requirements.txt\npre-commit install # for clean and tidy code\n# `overwrite_cache` will rebuild all rules according to\n# your modifications on tn/chinese/rules/xx.py (itn/chinese/rules/xx.py).\n# After rebuild, you can find new far files at `$PWD/tn` and `$PWD/itn`.\npython -m tn --text \"2.5\u5e73\u65b9\u7535\u7ebf\" --overwrite_cache\npython -m itn --text \"\u4e8c\u70b9\u4e94\u5e73\u65b9\u7535\u7ebf\" --overwrite_cache\n```\n\nOnce you successfully rebuild your rules, you can deploy them either with your installed pypi packages:\n\n```py\n# tn usage\n>>> from tn.chinese.normalizer import Normalizer\n>>> normalizer = Normalizer(cache_dir=\"PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn\")\n>>> normalizer.normalize(\"2.5\u5e73\u65b9\u7535\u7ebf\")\n# itn usage\n>>> from itn.chinese.inverse_normalizer import InverseNormalizer\n>>> invnormalizer = InverseNormalizer(cache_dir=\"PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn\")\n>>> invnormalizer.normalize(\"\u4e8c\u70b9\u4e94\u5e73\u65b9\u7535\u7ebf\")\n```\n\nOr with cpp runtime:\n\n```bash\ncmake -B build -S runtime -DCMAKE_BUILD_TYPE=Release\ncmake --build build\n# tn usage\ncache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn\n./build/processor_main --tagger $cache_dir/zh_tn_tagger.fst --verbalizer $cache_dir/zh_tn_verbalizer.fst --text \"2.5\u5e73\u65b9\u7535\u7ebf\"\n# itn usage\ncache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn\n./build/processor_main --tagger $cache_dir/zh_itn_tagger.fst --verbalizer $cache_dir/zh_itn_verbalizer.fst --text \"\u4e8c\u70b9\u4e94\u5e73\u65b9\u7535\u7ebf\"\n```\n\n### 2. TN Pipeline\n\nPlease refer to [TN.README](tn/README.md)\n\n### 3. ITN Pipeline\n\nPlease refer to [ITN.README](itn/README.md)\n\n## Discussion & Communication\n\nFor Chinese users, you can aslo scan the QR code on the left to follow our offical account of WeNet.\nWe created a WeChat group for better discussion and quicker response.\nPlease scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group.\n\n| <img src=\"https://github.com/robin1001/qr/blob/master/wenet.jpeg\" width=\"250px\"> | <img src=\"https://user-images.githubusercontent.com/13466943/203046432-f637180e-4c87-40cc-be05-ce48c65dd1ef.jpg\" width=\"250px\"> |\n| ---- | ---- |\n\nOr you can directly discuss on [Github Issues](https://github.com/wenet-e2e/WeTextProcessing/issues).\n\n## Acknowledge\n\n1. Thank the authors of foundational libraries like [OpenFst](https://www.openfst.org/twiki/bin/view/FST/WebHome) & [Pynini](https://www.openfst.org/twiki/bin/view/GRM/Pynini).\n3. Thank [NeMo](https://github.com/NVIDIA/NeMo) team & NeMo open-source community.\n2. Thank [Zhenxiang Ma](https://github.com/mzxcpp), [Jiayu Du](https://github.com/dophist), and [SpeechColab](https://github.com/SpeechColab) organization.\n3. Referred [Pynini](https://github.com/kylebgorman/pynini) for reading the FAR, and printing the shortest path of a lattice in the C++ runtime.\n4. Referred [TN of NeMo](https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing/text_normalization/zh) for the data to build the tagger graph.\n5. Referred [ITN of chinese_text_normalization](https://github.com/speechio/chinese_text_normalization/tree/master/thrax/src/cn) for the data to build the tagger graph.\n\n\n",
"bugtrack_url": null,
"license": null,
"summary": "WeTextProcessing, including TN & ITN",
"version": "1.0.3",
"project_urls": {
"Homepage": "https://github.com/wenet-e2e/WeTextProcessing"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ed8fc0fa549d04759fd48e27c55f8c2e69470c0228cbfba96ef3fef6a53bf383",
"md5": "f8019db79bc2f8bf4a5451417b3b31b1",
"sha256": "e84e8a28827cec06750e9d5f4229d80e438f4a7d7feb783e33abe0bbdb83385d"
},
"downloads": -1,
"filename": "WeTextProcessing-1.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f8019db79bc2f8bf4a5451417b3b31b1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 1970527,
"upload_time": "2024-07-04T09:25:28",
"upload_time_iso_8601": "2024-07-04T09:25:28.871283Z",
"url": "https://files.pythonhosted.org/packages/ed/8f/c0fa549d04759fd48e27c55f8c2e69470c0228cbfba96ef3fef6a53bf383/WeTextProcessing-1.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9b28edcb748b327b33c18ff5b4ea80e9e86b4ebd250c95de6448b98f7bee9659",
"md5": "639c9cf0655f259c989935ac789fd105",
"sha256": "c29649a74de1de0ee6a8603d4fec150c71f099230389b48d71e3de6463b7832e"
},
"downloads": -1,
"filename": "WeTextProcessing-1.0.3.tar.gz",
"has_sig": false,
"md5_digest": "639c9cf0655f259c989935ac789fd105",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 1844620,
"upload_time": "2024-07-04T09:25:31",
"upload_time_iso_8601": "2024-07-04T09:25:31.423878Z",
"url": "https://files.pythonhosted.org/packages/9b/28/edcb748b327b33c18ff5b4ea80e9e86b4ebd250c95de6448b98f7bee9659/WeTextProcessing-1.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-04 09:25:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "wenet-e2e",
"github_project": "WeTextProcessing",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "flake8",
"specs": []
},
{
"name": "importlib_resources",
"specs": []
},
{
"name": "pynini",
"specs": [
[
"==",
"2.1.6"
]
]
},
{
"name": "pytest",
"specs": []
},
{
"name": "pre-commit",
"specs": [
[
"==",
"3.5.0"
]
]
}
],
"lcname": "wetextprocessing"
}