pypinyin-g2pw


Namepypinyin-g2pw JSON
Version 0.4.0 PyPI version JSON
download
home_pagehttps://github.com/mozillazg/pypinyin-g2pW
Summary基于 g2pW 提升 pypinyin 的准确性。
upload_time2023-06-24 08:08:12
maintainer
docs_urlNone
authormozillazg
requires_python>=3.6, <4
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pypinyin-g2pW

基于 [g2pW](https://github.com/GitYCC/g2pW/) 提升 [pypinyin](https://github.com/mozillazg/python-pinyin) 的准确性。

特点:

* 可以通过训练模型的方式提升拼音准确性。
* 功能和使用习惯与 pypinyin 基本保持一致,支持多种拼音风格。


## 使用

### 安装依赖

1. 安装 [PyTorch](https://pytorch.org/get-started/locally/)。
2. 下载并解压 G2PWModel:

    ```
    wget https://storage.googleapis.com/esun-ai/g2pW/G2PWModel-v2-onnx.zip
    unzip G2PWModel-v2-onnx.zip
    ```
3. 安装 [git-lfs](https://git-lfs.github.com/)。
4. 下载 [bert-base-chinese](https://huggingface.co/bert-base-chinese):

   ```
   git lfs install
   git clone https://huggingface.co/bert-base-chinese
   ```
5. 安装本项目:

   ```
   pip install pypinyin-g2pw
   ```

### 使用示例

   ```python
   >>> from pypinyin import Style
   >>> from pypinyin_g2pw import G2PWPinyin

   # 需要将 model_dir 和 model_source 的值指向下载的模型数据目录
   >>> g2pw = G2PWPinyin(model_dir='G2PWModel/',
                         model_source='bert-base-chinese/',
                         v_to_u=False, neutral_tone_with_five=True)
   >>> han = '然而,他红了20年以后,他竟退出了大家的视线。'

   # def lazy_pinyin(self, hans, style=Style.NORMAL, errors='default', strict=True, **kwargs)
   # 通过 lazy_pinyin 方法获取拼音数据,各个参数的含义和作用跟 pypinyin 中是一样的,
   # v_to_u 和 neutral_tone_with_five 参数只能在初始化 G2PWPinyin 时指定。

   >>> g2pw.lazy_pinyin(han)
   ['ran', 'er', ',', 'ta', 'hong', 'le', '20', 'nian', 'yi', 'hou', ',', 'ta', 'jing', 'tui', 'chu', 'le', 'da', 'jia', 'de', 'shi', 'xian', '。']

   >>> g2pw.lazy_pinyin(han, style=Style.TONE)
   ['rán', 'ér', ',', 'tā', 'hóng', 'le', '20', 'nián', 'yǐ', 'hòu', ',', 'tā', 'jìng', 'tuì', 'chū', 'le', 'dà', 'jiā', 'de', 'shì', 'xiàn', '。']

   >>> g2pw.lazy_pinyin(han, style=Style.TONE3)
   ['ran2', 'er2', ',', 'ta1', 'hong2', 'le5', '20', 'nian2', 'yi3', 'hou4', ',', 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', '。']
   ```

## 离线使用

默认情况下,即便使用了离线的模型数据,程序使用的 transformers 模块仍旧会从 huggingface.co 下载部分模型元数据。
可以通过设置环境变量 `TRANSFORMERS_OFFLINE=1` 以及环境变量 `HF_DATASETS_OFFLINE=1` 禁用获取元数据的操作,实现完全离线使用的需求。
详见 [transformers 官方文档](https://huggingface.co/docs/transformers/v4.21.2/en/installation#offline-mode)。


## 模型训练

详见 [g2pW](https://github.com/GitYCC/g2pW/#train-model) 官方文档中的说明。



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/mozillazg/pypinyin-g2pW",
    "name": "pypinyin-g2pw",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6, <4",
    "maintainer_email": "",
    "keywords": "",
    "author": "mozillazg",
    "author_email": "mozillazg101@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/d8/67/35dca3acbe790143520f839260e8bc56d2c99b094a1c730ff62f24ba4443/pypinyin-g2pw-0.4.0.tar.gz",
    "platform": null,
    "description": "# pypinyin-g2pW\n\n\u57fa\u4e8e [g2pW](https://github.com/GitYCC/g2pW/) \u63d0\u5347 [pypinyin](https://github.com/mozillazg/python-pinyin) \u7684\u51c6\u786e\u6027\u3002\n\n\u7279\u70b9\uff1a\n\n* \u53ef\u4ee5\u901a\u8fc7\u8bad\u7ec3\u6a21\u578b\u7684\u65b9\u5f0f\u63d0\u5347\u62fc\u97f3\u51c6\u786e\u6027\u3002\n* \u529f\u80fd\u548c\u4f7f\u7528\u4e60\u60ef\u4e0e pypinyin \u57fa\u672c\u4fdd\u6301\u4e00\u81f4\uff0c\u652f\u6301\u591a\u79cd\u62fc\u97f3\u98ce\u683c\u3002\n\n\n## \u4f7f\u7528\n\n### \u5b89\u88c5\u4f9d\u8d56\n\n1. \u5b89\u88c5 [PyTorch](https://pytorch.org/get-started/locally/)\u3002\n2. \u4e0b\u8f7d\u5e76\u89e3\u538b G2PWModel:\n\n    ```\n    wget https://storage.googleapis.com/esun-ai/g2pW/G2PWModel-v2-onnx.zip\n    unzip G2PWModel-v2-onnx.zip\n    ```\n3. \u5b89\u88c5 [git-lfs](https://git-lfs.github.com/)\u3002\n4. \u4e0b\u8f7d [bert-base-chinese](https://huggingface.co/bert-base-chinese):\n\n   ```\n   git lfs install\n   git clone https://huggingface.co/bert-base-chinese\n   ```\n5. \u5b89\u88c5\u672c\u9879\u76ee:\n\n   ```\n   pip install pypinyin-g2pw\n   ```\n\n### \u4f7f\u7528\u793a\u4f8b\n\n   ```python\n   >>> from pypinyin import Style\n   >>> from pypinyin_g2pw import G2PWPinyin\n\n   # \u9700\u8981\u5c06 model_dir \u548c model_source \u7684\u503c\u6307\u5411\u4e0b\u8f7d\u7684\u6a21\u578b\u6570\u636e\u76ee\u5f55\n   >>> g2pw = G2PWPinyin(model_dir='G2PWModel/',\n                         model_source='bert-base-chinese/',\n                         v_to_u=False, neutral_tone_with_five=True)\n   >>> han = '\u7136\u800c\uff0c\u4ed6\u7ea2\u4e8620\u5e74\u4ee5\u540e\uff0c\u4ed6\u7adf\u9000\u51fa\u4e86\u5927\u5bb6\u7684\u89c6\u7ebf\u3002'\n\n   # def lazy_pinyin(self, hans, style=Style.NORMAL, errors='default', strict=True, **kwargs)\n   # \u901a\u8fc7 lazy_pinyin \u65b9\u6cd5\u83b7\u53d6\u62fc\u97f3\u6570\u636e\uff0c\u5404\u4e2a\u53c2\u6570\u7684\u542b\u4e49\u548c\u4f5c\u7528\u8ddf pypinyin \u4e2d\u662f\u4e00\u6837\u7684\uff0c\n   # v_to_u \u548c neutral_tone_with_five \u53c2\u6570\u53ea\u80fd\u5728\u521d\u59cb\u5316 G2PWPinyin \u65f6\u6307\u5b9a\u3002\n\n   >>> g2pw.lazy_pinyin(han)\n   ['ran', 'er', '\uff0c', 'ta', 'hong', 'le', '20', 'nian', 'yi', 'hou', '\uff0c', 'ta', 'jing', 'tui', 'chu', 'le', 'da', 'jia', 'de', 'shi', 'xian', '\u3002']\n\n   >>> g2pw.lazy_pinyin(han, style=Style.TONE)\n   ['r\u00e1n', '\u00e9r', '\uff0c', 't\u0101', 'h\u00f3ng', 'le', '20', 'ni\u00e1n', 'y\u01d0', 'h\u00f2u', '\uff0c', 't\u0101', 'j\u00ecng', 'tu\u00ec', 'ch\u016b', 'le', 'd\u00e0', 'ji\u0101', 'de', 'sh\u00ec', 'xi\u00e0n', '\u3002']\n\n   >>> g2pw.lazy_pinyin(han, style=Style.TONE3)\n   ['ran2', 'er2', '\uff0c', 'ta1', 'hong2', 'le5', '20', 'nian2', 'yi3', 'hou4', '\uff0c', 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', '\u3002']\n   ```\n\n## \u79bb\u7ebf\u4f7f\u7528\n\n\u9ed8\u8ba4\u60c5\u51b5\u4e0b\uff0c\u5373\u4fbf\u4f7f\u7528\u4e86\u79bb\u7ebf\u7684\u6a21\u578b\u6570\u636e\uff0c\u7a0b\u5e8f\u4f7f\u7528\u7684 transformers \u6a21\u5757\u4ecd\u65e7\u4f1a\u4ece huggingface.co \u4e0b\u8f7d\u90e8\u5206\u6a21\u578b\u5143\u6570\u636e\u3002\n\u53ef\u4ee5\u901a\u8fc7\u8bbe\u7f6e\u73af\u5883\u53d8\u91cf `TRANSFORMERS_OFFLINE=1` \u4ee5\u53ca\u73af\u5883\u53d8\u91cf `HF_DATASETS_OFFLINE=1` \u7981\u7528\u83b7\u53d6\u5143\u6570\u636e\u7684\u64cd\u4f5c\uff0c\u5b9e\u73b0\u5b8c\u5168\u79bb\u7ebf\u4f7f\u7528\u7684\u9700\u6c42\u3002\n\u8be6\u89c1 [transformers \u5b98\u65b9\u6587\u6863](https://huggingface.co/docs/transformers/v4.21.2/en/installation#offline-mode)\u3002\n\n\n## \u6a21\u578b\u8bad\u7ec3\n\n\u8be6\u89c1 [g2pW](https://github.com/GitYCC/g2pW/#train-model) \u5b98\u65b9\u6587\u6863\u4e2d\u7684\u8bf4\u660e\u3002\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "\u57fa\u4e8e g2pW \u63d0\u5347 pypinyin \u7684\u51c6\u786e\u6027\u3002",
    "version": "0.4.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/mozillazg/pypinyin-g2pW/issues",
        "Homepage": "https://github.com/mozillazg/pypinyin-g2pW"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2e0ccfa11790f5b7a78d6f0b1c4e6210aa8e5b89a3367d87150fd5255061167b",
                "md5": "d60bb94d6576d55a4d5ddb839d5392d5",
                "sha256": "bb71613b9b6cabe190e50ecc6aed1729f4027196b4ac82e6feb617b667d8380c"
            },
            "downloads": -1,
            "filename": "pypinyin_g2pw-0.4.0-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d60bb94d6576d55a4d5ddb839d5392d5",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.6, <4",
            "size": 4916,
            "upload_time": "2023-06-24T08:08:11",
            "upload_time_iso_8601": "2023-06-24T08:08:11.234249Z",
            "url": "https://files.pythonhosted.org/packages/2e/0c/cfa11790f5b7a78d6f0b1c4e6210aa8e5b89a3367d87150fd5255061167b/pypinyin_g2pw-0.4.0-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d86735dca3acbe790143520f839260e8bc56d2c99b094a1c730ff62f24ba4443",
                "md5": "656f3857f158167f4b3e5ecc0a6a40be",
                "sha256": "fb284b5ff4119b32db0d56935ba1da01d95e5e30666be9a0521a20d881d799f4"
            },
            "downloads": -1,
            "filename": "pypinyin-g2pw-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "656f3857f158167f4b3e5ecc0a6a40be",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6, <4",
            "size": 4217,
            "upload_time": "2023-06-24T08:08:12",
            "upload_time_iso_8601": "2023-06-24T08:08:12.834519Z",
            "url": "https://files.pythonhosted.org/packages/d8/67/35dca3acbe790143520f839260e8bc56d2c99b094a1c730ff62f24ba4443/pypinyin-g2pw-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-24 08:08:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mozillazg",
    "github_project": "pypinyin-g2pW",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pypinyin-g2pw"
}
        
Elapsed time: 0.49471s