WeTextProcessing


NameWeTextProcessing JSON
Version 1.0.3 PyPI version JSON
download
home_pagehttps://github.com/wenet-e2e/WeTextProcessing
SummaryWeTextProcessing, including TN & ITN
upload_time2024-07-04 09:25:31
maintainerNone
docs_urlNone
authorZhendong Peng, Xingchen Song
requires_pythonNone
licenseNone
keywords
VCS
bugtrack_url
requirements flake8 importlib_resources pynini pytest pre-commit
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## Text Normalization & Inverse Text Normalization

### 0. Brief Introduction

```diff
- **Must Read Doc** (In Chinese): https://mp.weixin.qq.com/s/q_11lck78qcjylHCi6wVsQ
```

[WeTextProcessing: Production First & Production Ready Text Processing Toolkit](https://mp.weixin.qq.com/s/q_11lck78qcjylHCi6wVsQ)

#### 0.1 Text Normalization

<div align=center><img src="https://user-images.githubusercontent.com/13466943/193439861-acfba531-13d1-4fca-b2f2-6e47fc10f195.png" alt="Cover" width="50%"/></div>

#### 0.2 Inverse Text Normalization

<div align=center><img src="https://user-images.githubusercontent.com/13466943/193439870-634c44a3-bd62-4311-bcf2-1427758d5f62.png" alt="Cover" width="50%"/></div>

### 1. How To Use

#### 1.1 Quick Start:
```bash
# install
pip install WeTextProcessing
```

Command-usage:

```bash
wetn --text "2.5平方电线"
weitn --text "二点五平方电线"
```

Python usage:

```py
from itn.chinese.inverse_normalizer import InverseNormalizer
from tn.chinese.normalizer import Normalizer as ZhNormalizer
from tn.english.normalizer import Normalizer as EnNormalizer

# NOTE(xcsong): 和默认参数不一致时,必须重新构图,要重新构图请务必指定 `overwrite_cache=True`
#               When the parameters differ from the defaults, it is mandatory to re-compose. To re-compose, please ensure you specify `overwrite_cache=True`.

zh_tn_text = "你好 WeTextProcessing 1.0,船新版本儿,船新体验儿,简直666,9和10"
zh_itn_text = "你好 WeTextProcessing 一点零,船新版本儿,船新体验儿,简直六六六,九和六"
en_tn_text = "Hello WeTextProcessing 1.0, life is short, just use wetext, 666, 9 and 10"
zh_tn_model = ZhNormalizer(remove_erhua=True, overwrite_cache=True)
zh_itn_model = InverseNormalizer(enable_0_to_9=False, overwrite_cache=True)
en_tn_model = EnNormalizer(overwrite_cache=True)
print("中文 TN (去除儿化音,重新在线构图):\n\t{} => {}".format(zh_tn_text, zh_tn_model.normalize(zh_tn_text)))
print("中文ITN (小于10的单独数字不转换,重新在线构图):\n\t{} => {}".format(zh_itn_text, zh_itn_model.normalize(zh_itn_text)))
print("英文 TN (暂时还没有可控的选项,后面会加...):\n\t{} => {}\n".format(en_tn_text, en_tn_model.normalize(en_tn_text)))

zh_tn_model = ZhNormalizer(overwrite_cache=False)
zh_itn_model = InverseNormalizer(overwrite_cache=False)
en_tn_model = EnNormalizer(overwrite_cache=False)
print("中文 TN (复用之前编译好的图):\n\t{} => {}".format(zh_tn_text, zh_tn_model.normalize(zh_tn_text)))
print("中文ITN (复用之前编译好的图):\n\t{} => {}".format(zh_itn_text, zh_itn_model.normalize(zh_itn_text)))
print("英文 TN (复用之前编译好的图):\n\t{} => {}\n".format(en_tn_text, en_tn_model.normalize(en_tn_text)))

zh_tn_model = ZhNormalizer(remove_erhua=False, overwrite_cache=True)
zh_itn_model = InverseNormalizer(enable_0_to_9=True, overwrite_cache=True)
print("中文 TN (不去除儿化音,重新在线构图):\n\t{} => {}".format(zh_tn_text, zh_tn_model.normalize(zh_tn_text)))
print("中文ITN (小于10的单独数字也进行转换,重新在线构图):\n\t{} => {}\n".format(zh_itn_text, zh_itn_model.normalize(zh_itn_text)))
```

#### 1.2 Advanced Usage:

DIY your own rules && Deploy WeTextProcessing with cpp runtime !!

For users who want modifications and adapt tn/itn rules to fix badcase, please try:

``` bash
git clone https://github.com/wenet-e2e/WeTextProcessing.git
cd WeTextProcessing
pip install -r requirements.txt
pre-commit install # for clean and tidy code
# `overwrite_cache` will rebuild all rules according to
#   your modifications on tn/chinese/rules/xx.py (itn/chinese/rules/xx.py).
#   After rebuild, you can find new far files at `$PWD/tn` and `$PWD/itn`.
python -m tn --text "2.5平方电线" --overwrite_cache
python -m itn --text "二点五平方电线" --overwrite_cache
```

Once you successfully rebuild your rules, you can deploy them either with your installed pypi packages:

```py
# tn usage
>>> from tn.chinese.normalizer import Normalizer
>>> normalizer = Normalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn")
>>> normalizer.normalize("2.5平方电线")
# itn usage
>>> from itn.chinese.inverse_normalizer import InverseNormalizer
>>> invnormalizer = InverseNormalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn")
>>> invnormalizer.normalize("二点五平方电线")
```

Or with cpp runtime:

```bash
cmake -B build -S runtime -DCMAKE_BUILD_TYPE=Release
cmake --build build
# tn usage
cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn
./build/processor_main --tagger $cache_dir/zh_tn_tagger.fst --verbalizer $cache_dir/zh_tn_verbalizer.fst --text "2.5平方电线"
# itn usage
cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn
./build/processor_main --tagger $cache_dir/zh_itn_tagger.fst --verbalizer $cache_dir/zh_itn_verbalizer.fst --text "二点五平方电线"
```

### 2. TN Pipeline

Please refer to [TN.README](tn/README.md)

### 3. ITN Pipeline

Please refer to [ITN.README](itn/README.md)

## Discussion & Communication

For Chinese users, you can aslo scan the QR code on the left to follow our offical account of WeNet.
We created a WeChat group for better discussion and quicker response.
Please scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group.

| <img src="https://github.com/robin1001/qr/blob/master/wenet.jpeg" width="250px"> | <img src="https://user-images.githubusercontent.com/13466943/203046432-f637180e-4c87-40cc-be05-ce48c65dd1ef.jpg" width="250px"> |
| ---- | ---- |

Or you can directly discuss on [Github Issues](https://github.com/wenet-e2e/WeTextProcessing/issues).

## Acknowledge

1. Thank the authors of foundational libraries like [OpenFst](https://www.openfst.org/twiki/bin/view/FST/WebHome) & [Pynini](https://www.openfst.org/twiki/bin/view/GRM/Pynini).
3. Thank [NeMo](https://github.com/NVIDIA/NeMo) team & NeMo open-source community.
2. Thank [Zhenxiang Ma](https://github.com/mzxcpp), [Jiayu Du](https://github.com/dophist), and [SpeechColab](https://github.com/SpeechColab) organization.
3. Referred [Pynini](https://github.com/kylebgorman/pynini) for reading the FAR, and printing the shortest path of a lattice in the C++ runtime.
4. Referred [TN of NeMo](https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing/text_normalization/zh) for the data to build the tagger graph.
5. Referred [ITN of chinese_text_normalization](https://github.com/speechio/chinese_text_normalization/tree/master/thrax/src/cn) for the data to build the tagger graph.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/wenet-e2e/WeTextProcessing",
    "name": "WeTextProcessing",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Zhendong Peng, Xingchen Song",
    "author_email": "pzd17@tsinghua.org.cn, sxc19@tsinghua.org.cn",
    "download_url": "https://files.pythonhosted.org/packages/9b/28/edcb748b327b33c18ff5b4ea80e9e86b4ebd250c95de6448b98f7bee9659/WeTextProcessing-1.0.3.tar.gz",
    "platform": null,
    "description": "## Text Normalization & Inverse Text Normalization\n\n### 0. Brief Introduction\n\n```diff\n- **Must Read Doc** (In Chinese): https://mp.weixin.qq.com/s/q_11lck78qcjylHCi6wVsQ\n```\n\n[WeTextProcessing: Production First & Production Ready Text Processing Toolkit](https://mp.weixin.qq.com/s/q_11lck78qcjylHCi6wVsQ)\n\n#### 0.1 Text Normalization\n\n<div align=center><img src=\"https://user-images.githubusercontent.com/13466943/193439861-acfba531-13d1-4fca-b2f2-6e47fc10f195.png\" alt=\"Cover\" width=\"50%\"/></div>\n\n#### 0.2 Inverse Text Normalization\n\n<div align=center><img src=\"https://user-images.githubusercontent.com/13466943/193439870-634c44a3-bd62-4311-bcf2-1427758d5f62.png\" alt=\"Cover\" width=\"50%\"/></div>\n\n### 1. How To Use\n\n#### 1.1 Quick Start:\n```bash\n# install\npip install WeTextProcessing\n```\n\nCommand-usage:\n\n```bash\nwetn --text \"2.5\u5e73\u65b9\u7535\u7ebf\"\nweitn --text \"\u4e8c\u70b9\u4e94\u5e73\u65b9\u7535\u7ebf\"\n```\n\nPython usage:\n\n```py\nfrom itn.chinese.inverse_normalizer import InverseNormalizer\nfrom tn.chinese.normalizer import Normalizer as ZhNormalizer\nfrom tn.english.normalizer import Normalizer as EnNormalizer\n\n# NOTE(xcsong): \u548c\u9ed8\u8ba4\u53c2\u6570\u4e0d\u4e00\u81f4\u65f6\uff0c\u5fc5\u987b\u91cd\u65b0\u6784\u56fe\uff0c\u8981\u91cd\u65b0\u6784\u56fe\u8bf7\u52a1\u5fc5\u6307\u5b9a `overwrite_cache=True`\n#               When the parameters differ from the defaults, it is mandatory to re-compose. To re-compose, please ensure you specify `overwrite_cache=True`.\n\nzh_tn_text = \"\u4f60\u597d WeTextProcessing 1.0\uff0c\u8239\u65b0\u7248\u672c\u513f\uff0c\u8239\u65b0\u4f53\u9a8c\u513f\uff0c\u7b80\u76f4666\uff0c9\u548c10\"\nzh_itn_text = \"\u4f60\u597d WeTextProcessing \u4e00\u70b9\u96f6\uff0c\u8239\u65b0\u7248\u672c\u513f\uff0c\u8239\u65b0\u4f53\u9a8c\u513f\uff0c\u7b80\u76f4\u516d\u516d\u516d\uff0c\u4e5d\u548c\u516d\"\nen_tn_text = \"Hello WeTextProcessing 1.0, life is short, just use wetext, 666, 9 and 10\"\nzh_tn_model = ZhNormalizer(remove_erhua=True, overwrite_cache=True)\nzh_itn_model = InverseNormalizer(enable_0_to_9=False, overwrite_cache=True)\nen_tn_model = EnNormalizer(overwrite_cache=True)\nprint(\"\u4e2d\u6587 TN (\u53bb\u9664\u513f\u5316\u97f3\uff0c\u91cd\u65b0\u5728\u7ebf\u6784\u56fe):\\n\\t{} => {}\".format(zh_tn_text, zh_tn_model.normalize(zh_tn_text)))\nprint(\"\u4e2d\u6587ITN (\u5c0f\u4e8e10\u7684\u5355\u72ec\u6570\u5b57\u4e0d\u8f6c\u6362\uff0c\u91cd\u65b0\u5728\u7ebf\u6784\u56fe):\\n\\t{} => {}\".format(zh_itn_text, zh_itn_model.normalize(zh_itn_text)))\nprint(\"\u82f1\u6587 TN (\u6682\u65f6\u8fd8\u6ca1\u6709\u53ef\u63a7\u7684\u9009\u9879\uff0c\u540e\u9762\u4f1a\u52a0...):\\n\\t{} => {}\\n\".format(en_tn_text, en_tn_model.normalize(en_tn_text)))\n\nzh_tn_model = ZhNormalizer(overwrite_cache=False)\nzh_itn_model = InverseNormalizer(overwrite_cache=False)\nen_tn_model = EnNormalizer(overwrite_cache=False)\nprint(\"\u4e2d\u6587 TN (\u590d\u7528\u4e4b\u524d\u7f16\u8bd1\u597d\u7684\u56fe):\\n\\t{} => {}\".format(zh_tn_text, zh_tn_model.normalize(zh_tn_text)))\nprint(\"\u4e2d\u6587ITN (\u590d\u7528\u4e4b\u524d\u7f16\u8bd1\u597d\u7684\u56fe):\\n\\t{} => {}\".format(zh_itn_text, zh_itn_model.normalize(zh_itn_text)))\nprint(\"\u82f1\u6587 TN (\u590d\u7528\u4e4b\u524d\u7f16\u8bd1\u597d\u7684\u56fe):\\n\\t{} => {}\\n\".format(en_tn_text, en_tn_model.normalize(en_tn_text)))\n\nzh_tn_model = ZhNormalizer(remove_erhua=False, overwrite_cache=True)\nzh_itn_model = InverseNormalizer(enable_0_to_9=True, overwrite_cache=True)\nprint(\"\u4e2d\u6587 TN (\u4e0d\u53bb\u9664\u513f\u5316\u97f3\uff0c\u91cd\u65b0\u5728\u7ebf\u6784\u56fe):\\n\\t{} => {}\".format(zh_tn_text, zh_tn_model.normalize(zh_tn_text)))\nprint(\"\u4e2d\u6587ITN (\u5c0f\u4e8e10\u7684\u5355\u72ec\u6570\u5b57\u4e5f\u8fdb\u884c\u8f6c\u6362\uff0c\u91cd\u65b0\u5728\u7ebf\u6784\u56fe):\\n\\t{} => {}\\n\".format(zh_itn_text, zh_itn_model.normalize(zh_itn_text)))\n```\n\n#### 1.2 Advanced Usage:\n\nDIY your own rules && Deploy WeTextProcessing with cpp runtime !!\n\nFor users who want modifications and adapt tn/itn rules to fix badcase, please try:\n\n``` bash\ngit clone https://github.com/wenet-e2e/WeTextProcessing.git\ncd WeTextProcessing\npip install -r requirements.txt\npre-commit install # for clean and tidy code\n# `overwrite_cache` will rebuild all rules according to\n#   your modifications on tn/chinese/rules/xx.py (itn/chinese/rules/xx.py).\n#   After rebuild, you can find new far files at `$PWD/tn` and `$PWD/itn`.\npython -m tn --text \"2.5\u5e73\u65b9\u7535\u7ebf\" --overwrite_cache\npython -m itn --text \"\u4e8c\u70b9\u4e94\u5e73\u65b9\u7535\u7ebf\" --overwrite_cache\n```\n\nOnce you successfully rebuild your rules, you can deploy them either with your installed pypi packages:\n\n```py\n# tn usage\n>>> from tn.chinese.normalizer import Normalizer\n>>> normalizer = Normalizer(cache_dir=\"PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn\")\n>>> normalizer.normalize(\"2.5\u5e73\u65b9\u7535\u7ebf\")\n# itn usage\n>>> from itn.chinese.inverse_normalizer import InverseNormalizer\n>>> invnormalizer = InverseNormalizer(cache_dir=\"PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn\")\n>>> invnormalizer.normalize(\"\u4e8c\u70b9\u4e94\u5e73\u65b9\u7535\u7ebf\")\n```\n\nOr with cpp runtime:\n\n```bash\ncmake -B build -S runtime -DCMAKE_BUILD_TYPE=Release\ncmake --build build\n# tn usage\ncache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn\n./build/processor_main --tagger $cache_dir/zh_tn_tagger.fst --verbalizer $cache_dir/zh_tn_verbalizer.fst --text \"2.5\u5e73\u65b9\u7535\u7ebf\"\n# itn usage\ncache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn\n./build/processor_main --tagger $cache_dir/zh_itn_tagger.fst --verbalizer $cache_dir/zh_itn_verbalizer.fst --text \"\u4e8c\u70b9\u4e94\u5e73\u65b9\u7535\u7ebf\"\n```\n\n### 2. TN Pipeline\n\nPlease refer to [TN.README](tn/README.md)\n\n### 3. ITN Pipeline\n\nPlease refer to [ITN.README](itn/README.md)\n\n## Discussion & Communication\n\nFor Chinese users, you can aslo scan the QR code on the left to follow our offical account of WeNet.\nWe created a WeChat group for better discussion and quicker response.\nPlease scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group.\n\n| <img src=\"https://github.com/robin1001/qr/blob/master/wenet.jpeg\" width=\"250px\"> | <img src=\"https://user-images.githubusercontent.com/13466943/203046432-f637180e-4c87-40cc-be05-ce48c65dd1ef.jpg\" width=\"250px\"> |\n| ---- | ---- |\n\nOr you can directly discuss on [Github Issues](https://github.com/wenet-e2e/WeTextProcessing/issues).\n\n## Acknowledge\n\n1. Thank the authors of foundational libraries like [OpenFst](https://www.openfst.org/twiki/bin/view/FST/WebHome) & [Pynini](https://www.openfst.org/twiki/bin/view/GRM/Pynini).\n3. Thank [NeMo](https://github.com/NVIDIA/NeMo) team & NeMo open-source community.\n2. Thank [Zhenxiang Ma](https://github.com/mzxcpp), [Jiayu Du](https://github.com/dophist), and [SpeechColab](https://github.com/SpeechColab) organization.\n3. Referred [Pynini](https://github.com/kylebgorman/pynini) for reading the FAR, and printing the shortest path of a lattice in the C++ runtime.\n4. Referred [TN of NeMo](https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing/text_normalization/zh) for the data to build the tagger graph.\n5. Referred [ITN of chinese_text_normalization](https://github.com/speechio/chinese_text_normalization/tree/master/thrax/src/cn) for the data to build the tagger graph.\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "WeTextProcessing, including TN & ITN",
    "version": "1.0.3",
    "project_urls": {
        "Homepage": "https://github.com/wenet-e2e/WeTextProcessing"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ed8fc0fa549d04759fd48e27c55f8c2e69470c0228cbfba96ef3fef6a53bf383",
                "md5": "f8019db79bc2f8bf4a5451417b3b31b1",
                "sha256": "e84e8a28827cec06750e9d5f4229d80e438f4a7d7feb783e33abe0bbdb83385d"
            },
            "downloads": -1,
            "filename": "WeTextProcessing-1.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f8019db79bc2f8bf4a5451417b3b31b1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 1970527,
            "upload_time": "2024-07-04T09:25:28",
            "upload_time_iso_8601": "2024-07-04T09:25:28.871283Z",
            "url": "https://files.pythonhosted.org/packages/ed/8f/c0fa549d04759fd48e27c55f8c2e69470c0228cbfba96ef3fef6a53bf383/WeTextProcessing-1.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9b28edcb748b327b33c18ff5b4ea80e9e86b4ebd250c95de6448b98f7bee9659",
                "md5": "639c9cf0655f259c989935ac789fd105",
                "sha256": "c29649a74de1de0ee6a8603d4fec150c71f099230389b48d71e3de6463b7832e"
            },
            "downloads": -1,
            "filename": "WeTextProcessing-1.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "639c9cf0655f259c989935ac789fd105",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 1844620,
            "upload_time": "2024-07-04T09:25:31",
            "upload_time_iso_8601": "2024-07-04T09:25:31.423878Z",
            "url": "https://files.pythonhosted.org/packages/9b/28/edcb748b327b33c18ff5b4ea80e9e86b4ebd250c95de6448b98f7bee9659/WeTextProcessing-1.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-04 09:25:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "wenet-e2e",
    "github_project": "WeTextProcessing",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "flake8",
            "specs": []
        },
        {
            "name": "importlib_resources",
            "specs": []
        },
        {
            "name": "pynini",
            "specs": [
                [
                    "==",
                    "2.1.6"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": []
        },
        {
            "name": "pre-commit",
            "specs": [
                [
                    "==",
                    "3.5.0"
                ]
            ]
        }
    ],
    "lcname": "wetextprocessing"
}
        
Elapsed time: 3.84996s