# pypinyin-g2pW
基于 [g2pW](https://github.com/GitYCC/g2pW/) 提升 [pypinyin](https://github.com/mozillazg/python-pinyin) 的准确性。
特点:
* 可以通过训练模型的方式提升拼音准确性。
* 功能和使用习惯与 pypinyin 基本保持一致,支持多种拼音风格。
## 使用
### 安装依赖
1. 安装 [PyTorch](https://pytorch.org/get-started/locally/)。
2. 下载并解压 G2PWModel:
```
wget https://storage.googleapis.com/esun-ai/g2pW/G2PWModel-v2-onnx.zip
unzip G2PWModel-v2-onnx.zip
```
3. 安装 [git-lfs](https://git-lfs.github.com/)。
4. 下载 [bert-base-chinese](https://huggingface.co/bert-base-chinese):
```
git lfs install
git clone https://huggingface.co/bert-base-chinese
```
5. 安装本项目:
```
pip install pypinyin-g2pw
```
### 使用示例
```python
>>> from pypinyin import Style
>>> from pypinyin_g2pw import G2PWPinyin
# 需要将 model_dir 和 model_source 的值指向下载的模型数据目录
>>> g2pw = G2PWPinyin(model_dir='G2PWModel/',
model_source='bert-base-chinese/',
v_to_u=False, neutral_tone_with_five=True)
>>> han = '然而,他红了20年以后,他竟退出了大家的视线。'
# def lazy_pinyin(self, hans, style=Style.NORMAL, errors='default', strict=True, **kwargs)
# 通过 lazy_pinyin 方法获取拼音数据,各个参数的含义和作用跟 pypinyin 中是一样的,
# v_to_u 和 neutral_tone_with_five 参数只能在初始化 G2PWPinyin 时指定。
>>> g2pw.lazy_pinyin(han)
['ran', 'er', ',', 'ta', 'hong', 'le', '20', 'nian', 'yi', 'hou', ',', 'ta', 'jing', 'tui', 'chu', 'le', 'da', 'jia', 'de', 'shi', 'xian', '。']
>>> g2pw.lazy_pinyin(han, style=Style.TONE)
['rán', 'ér', ',', 'tā', 'hóng', 'le', '20', 'nián', 'yǐ', 'hòu', ',', 'tā', 'jìng', 'tuì', 'chū', 'le', 'dà', 'jiā', 'de', 'shì', 'xiàn', '。']
>>> g2pw.lazy_pinyin(han, style=Style.TONE3)
['ran2', 'er2', ',', 'ta1', 'hong2', 'le5', '20', 'nian2', 'yi3', 'hou4', ',', 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', '。']
```
## 离线使用
默认情况下,即便使用了离线的模型数据,程序使用的 transformers 模块仍旧会从 huggingface.co 下载部分模型元数据。
可以通过设置环境变量 `TRANSFORMERS_OFFLINE=1` 以及环境变量 `HF_DATASETS_OFFLINE=1` 禁用获取元数据的操作,实现完全离线使用的需求。
详见 [transformers 官方文档](https://huggingface.co/docs/transformers/v4.21.2/en/installation#offline-mode)。
## 模型训练
详见 [g2pW](https://github.com/GitYCC/g2pW/#train-model) 官方文档中的说明。
Raw data
{
"_id": null,
"home_page": "https://github.com/mozillazg/pypinyin-g2pW",
"name": "pypinyin-g2pw",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6, <4",
"maintainer_email": "",
"keywords": "",
"author": "mozillazg",
"author_email": "mozillazg101@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/d8/67/35dca3acbe790143520f839260e8bc56d2c99b094a1c730ff62f24ba4443/pypinyin-g2pw-0.4.0.tar.gz",
"platform": null,
"description": "# pypinyin-g2pW\n\n\u57fa\u4e8e [g2pW](https://github.com/GitYCC/g2pW/) \u63d0\u5347 [pypinyin](https://github.com/mozillazg/python-pinyin) \u7684\u51c6\u786e\u6027\u3002\n\n\u7279\u70b9\uff1a\n\n* \u53ef\u4ee5\u901a\u8fc7\u8bad\u7ec3\u6a21\u578b\u7684\u65b9\u5f0f\u63d0\u5347\u62fc\u97f3\u51c6\u786e\u6027\u3002\n* \u529f\u80fd\u548c\u4f7f\u7528\u4e60\u60ef\u4e0e pypinyin \u57fa\u672c\u4fdd\u6301\u4e00\u81f4\uff0c\u652f\u6301\u591a\u79cd\u62fc\u97f3\u98ce\u683c\u3002\n\n\n## \u4f7f\u7528\n\n### \u5b89\u88c5\u4f9d\u8d56\n\n1. \u5b89\u88c5 [PyTorch](https://pytorch.org/get-started/locally/)\u3002\n2. \u4e0b\u8f7d\u5e76\u89e3\u538b G2PWModel:\n\n ```\n wget https://storage.googleapis.com/esun-ai/g2pW/G2PWModel-v2-onnx.zip\n unzip G2PWModel-v2-onnx.zip\n ```\n3. \u5b89\u88c5 [git-lfs](https://git-lfs.github.com/)\u3002\n4. \u4e0b\u8f7d [bert-base-chinese](https://huggingface.co/bert-base-chinese):\n\n ```\n git lfs install\n git clone https://huggingface.co/bert-base-chinese\n ```\n5. \u5b89\u88c5\u672c\u9879\u76ee:\n\n ```\n pip install pypinyin-g2pw\n ```\n\n### \u4f7f\u7528\u793a\u4f8b\n\n ```python\n >>> from pypinyin import Style\n >>> from pypinyin_g2pw import G2PWPinyin\n\n # \u9700\u8981\u5c06 model_dir \u548c model_source \u7684\u503c\u6307\u5411\u4e0b\u8f7d\u7684\u6a21\u578b\u6570\u636e\u76ee\u5f55\n >>> g2pw = G2PWPinyin(model_dir='G2PWModel/',\n model_source='bert-base-chinese/',\n v_to_u=False, neutral_tone_with_five=True)\n >>> han = '\u7136\u800c\uff0c\u4ed6\u7ea2\u4e8620\u5e74\u4ee5\u540e\uff0c\u4ed6\u7adf\u9000\u51fa\u4e86\u5927\u5bb6\u7684\u89c6\u7ebf\u3002'\n\n # def lazy_pinyin(self, hans, style=Style.NORMAL, errors='default', strict=True, **kwargs)\n # \u901a\u8fc7 lazy_pinyin \u65b9\u6cd5\u83b7\u53d6\u62fc\u97f3\u6570\u636e\uff0c\u5404\u4e2a\u53c2\u6570\u7684\u542b\u4e49\u548c\u4f5c\u7528\u8ddf pypinyin \u4e2d\u662f\u4e00\u6837\u7684\uff0c\n # v_to_u \u548c neutral_tone_with_five \u53c2\u6570\u53ea\u80fd\u5728\u521d\u59cb\u5316 G2PWPinyin \u65f6\u6307\u5b9a\u3002\n\n >>> g2pw.lazy_pinyin(han)\n ['ran', 'er', '\uff0c', 'ta', 'hong', 'le', '20', 'nian', 'yi', 'hou', '\uff0c', 'ta', 'jing', 'tui', 'chu', 'le', 'da', 'jia', 'de', 'shi', 'xian', '\u3002']\n\n >>> g2pw.lazy_pinyin(han, style=Style.TONE)\n ['r\u00e1n', '\u00e9r', '\uff0c', 't\u0101', 'h\u00f3ng', 'le', '20', 'ni\u00e1n', 'y\u01d0', 'h\u00f2u', '\uff0c', 't\u0101', 'j\u00ecng', 'tu\u00ec', 'ch\u016b', 'le', 'd\u00e0', 'ji\u0101', 'de', 'sh\u00ec', 'xi\u00e0n', '\u3002']\n\n >>> g2pw.lazy_pinyin(han, style=Style.TONE3)\n ['ran2', 'er2', '\uff0c', 'ta1', 'hong2', 'le5', '20', 'nian2', 'yi3', 'hou4', '\uff0c', 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', '\u3002']\n ```\n\n## \u79bb\u7ebf\u4f7f\u7528\n\n\u9ed8\u8ba4\u60c5\u51b5\u4e0b\uff0c\u5373\u4fbf\u4f7f\u7528\u4e86\u79bb\u7ebf\u7684\u6a21\u578b\u6570\u636e\uff0c\u7a0b\u5e8f\u4f7f\u7528\u7684 transformers \u6a21\u5757\u4ecd\u65e7\u4f1a\u4ece huggingface.co \u4e0b\u8f7d\u90e8\u5206\u6a21\u578b\u5143\u6570\u636e\u3002\n\u53ef\u4ee5\u901a\u8fc7\u8bbe\u7f6e\u73af\u5883\u53d8\u91cf `TRANSFORMERS_OFFLINE=1` \u4ee5\u53ca\u73af\u5883\u53d8\u91cf `HF_DATASETS_OFFLINE=1` \u7981\u7528\u83b7\u53d6\u5143\u6570\u636e\u7684\u64cd\u4f5c\uff0c\u5b9e\u73b0\u5b8c\u5168\u79bb\u7ebf\u4f7f\u7528\u7684\u9700\u6c42\u3002\n\u8be6\u89c1 [transformers \u5b98\u65b9\u6587\u6863](https://huggingface.co/docs/transformers/v4.21.2/en/installation#offline-mode)\u3002\n\n\n## \u6a21\u578b\u8bad\u7ec3\n\n\u8be6\u89c1 [g2pW](https://github.com/GitYCC/g2pW/#train-model) \u5b98\u65b9\u6587\u6863\u4e2d\u7684\u8bf4\u660e\u3002\n\n\n",
"bugtrack_url": null,
"license": "",
"summary": "\u57fa\u4e8e g2pW \u63d0\u5347 pypinyin \u7684\u51c6\u786e\u6027\u3002",
"version": "0.4.0",
"project_urls": {
"Bug Tracker": "https://github.com/mozillazg/pypinyin-g2pW/issues",
"Homepage": "https://github.com/mozillazg/pypinyin-g2pW"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2e0ccfa11790f5b7a78d6f0b1c4e6210aa8e5b89a3367d87150fd5255061167b",
"md5": "d60bb94d6576d55a4d5ddb839d5392d5",
"sha256": "bb71613b9b6cabe190e50ecc6aed1729f4027196b4ac82e6feb617b667d8380c"
},
"downloads": -1,
"filename": "pypinyin_g2pw-0.4.0-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "d60bb94d6576d55a4d5ddb839d5392d5",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.6, <4",
"size": 4916,
"upload_time": "2023-06-24T08:08:11",
"upload_time_iso_8601": "2023-06-24T08:08:11.234249Z",
"url": "https://files.pythonhosted.org/packages/2e/0c/cfa11790f5b7a78d6f0b1c4e6210aa8e5b89a3367d87150fd5255061167b/pypinyin_g2pw-0.4.0-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d86735dca3acbe790143520f839260e8bc56d2c99b094a1c730ff62f24ba4443",
"md5": "656f3857f158167f4b3e5ecc0a6a40be",
"sha256": "fb284b5ff4119b32db0d56935ba1da01d95e5e30666be9a0521a20d881d799f4"
},
"downloads": -1,
"filename": "pypinyin-g2pw-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "656f3857f158167f4b3e5ecc0a6a40be",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6, <4",
"size": 4217,
"upload_time": "2023-06-24T08:08:12",
"upload_time_iso_8601": "2023-06-24T08:08:12.834519Z",
"url": "https://files.pythonhosted.org/packages/d8/67/35dca3acbe790143520f839260e8bc56d2c99b094a1c730ff62f24ba4443/pypinyin-g2pw-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-24 08:08:12",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mozillazg",
"github_project": "pypinyin-g2pW",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pypinyin-g2pw"
}