pinyintokenizer

Name	pinyintokenizer JSON
Version	0.0.3 JSON
	download
home_page	https://github.com/shibing624/pinyin-tokenizer
Summary	Pinyin Tokenizer, chinese pinyin tokenizer
upload_time	2025-01-28 09:07:39
maintainer	None
docs_url	None
author	XuMing
requires_python	None
license	Apache 2.0
keywords	pinyin-tokenizer pinyin pinyintokenizer tokenizer
VCS
bugtrack_url
requirements	six
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![PyPI version](https://badge.fury.io/py/pinyintokenizer.svg)](https://badge.fury.io/py/pinyin-tokenizer)
[![Downloads](https://static.pepy.tech/badge/pinyintokenizer)](https://pepy.tech/project/pinyin-tokenizer)
[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)
[![GitHub contributors](https://img.shields.io/github/contributors/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/graphs/contributors)
[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
[![python_vesion](https://img.shields.io/badge/Python-3.5%2B-green.svg)](requirements.txt)
[![GitHub issues](https://img.shields.io/github/issues/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/issues)
[![Wechat Group](https://img.shields.io/badge/wechat-group-green.svg?logo=wechat)](#Contact)

# Pinyin Tokenizer
pinyin tokenizer（拼音分词器），将连续的拼音切分为单字拼音列表，开箱即用。python3开发。


**Guide**

- [Feature](#Feature)
- [Install](#install)
- [Usage](#usage)
- [Contact](#Contact)
- [Citation](#Citation)
- [Related-Projects](#Related-Projects)

## Feature

- 基于前缀树（PyTrie）高效快速把连续拼音切分为单字拼音列表，便于后续拼音转汉字等处理。

## Install

- Requirements and Installation

```
pip install pinyintokenizer
```

or

```
git clone https://github.com/shibing624/pinyin-tokenizer.git
cd pinyin-tokenizer
python setup.py install
```


## Usage

### 拼音切分（Pinyin Tokenizer）

example：[examples/pinyin_tokenize_demo.py](examples/pinyin_tokenize_demo.py):


```python
import sys

sys.path.append('..')
from pinyintokenizer import PinyinTokenizer

if __name__ == '__main__':
    m = PinyinTokenizer()
    print(f"{m.tokenize('wo3')}")
    print(f"{m.tokenize('nihao')}")
    print(f"{m.tokenize('lv3you2')}")  # 旅游
    print(f"{m.tokenize('liudehua')}")
    print(f"{m.tokenize('liu de hua')}")  # 刘德华
    print(f"{m.tokenize('womenzuogelvyougongnue')}")  # 我们做个旅游攻略
    print(f"{m.tokenize('xi anjiaotongdaxue')}")  # 西安交通大学

    # not support english
    print(f"{m.tokenize('good luck')}")
```

output:

```shell
(['wo'], ['3'])
(['ni', 'hao'], [])
(['lv', 'you'], ['3', '2'])
(['liu', 'de', 'hua'], [])
(['liu', 'de', 'hua'], [' ', ' '])
(['wo', 'men', 'zuo', 'ge', 'lv', 'you', 'gong', 'nue'], [])
(['xi', 'an', 'jiao', 'tong', 'da', 'xue'], [' '])
(['o', 'o', 'lu'], ['g', 'd', ' ', 'c', 'k'])
```
- `tokenize`方法返回两个结果，第一个为拼音列表，第二个为非法拼音列表。


### 连续拼音转汉字（Pinyin2Hanzi）
先使用本库[pinyintokenizer](https://pypi.org/project/pinyintokenizer/)把连续拼音切分，再使用[Pinyin2Hanzi](https://pypi.org/project/Pinyin2Hanzi/)库把拼音转汉字。

example：[examples/pinyin2hanzi_demo.py](examples/pinyin2hanzi_demo.py):


```python
import sys
from Pinyin2Hanzi import DefaultDagParams
from Pinyin2Hanzi import dag

sys.path.append('..')
from pinyintokenizer import PinyinTokenizer

dagparams = DefaultDagParams()


def pinyin2hanzi(pinyin_sentence):
    pinyin_list, _ = PinyinTokenizer().tokenize(pinyin_sentence)
    result = dag(dagparams, pinyin_list, path_num=1)
    return ''.join(result[0].path)


if __name__ == '__main__':
    print(f"{pinyin2hanzi('wo3')}")
    print(f"{pinyin2hanzi('jintianxtianqibucuo')}")
    print(f"{pinyin2hanzi('liudehua')}")
```

output:

```shell
我
今天天气不错
刘德华
```



## Contact

- Issue(建议)：[![GitHub issues](https://img.shields.io/github/issues/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/issues)
- 邮件我：xuming: xuming624@qq.com
- 微信我：加我*微信号：xuming624*, 进Python-NLP交流群，备注：*姓名-公司名-NLP*
<img src="docs/wechat.jpeg" width="200" />


## Citation

如果你在研究中使用了pinyin-tokenizer，请按如下格式引用：

APA:
```latex
Xu, M. pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP (Version 0.0.1) [Computer software]. https://github.com/shibing624/pinyin-tokenizer
```

BibTeX:
```latex
@misc{pinyin-tokenizer,
  title={pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP},
  author={Xu Ming},
  year={2022},
  howpublished={\url{https://github.com/shibing624/pinyin-tokenizer}},
}
```


## License


授权协议为 [The Apache License 2.0](LICENSE)，可免费用做商业用途。请在产品说明中附加**pinyin-tokenizer**的链接和授权协议。


## Contribute
项目代码还很粗糙，如果大家对代码有所改进，欢迎提交回本项目，在提交之前，注意以下两点：

 - 在`tests`添加相应的单元测试
 - 使用`python -m pytest`来运行所有单元测试，确保所有单测都是通过的

之后即可提交PR。


## Related Projects

- 汉字转拼音：[pypinyin](https://github.com/mozillazg/python-pinyin)
- 拼音转汉字：[Pinyin2Hanzi](https://github.com/letiantian/Pinyin2Hanzi)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/shibing624/pinyin-tokenizer",
    "name": "pinyintokenizer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "pinyin-tokenizer, pinyin, pinyintokenizer, tokenizer",
    "author": "XuMing",
    "author_email": "xuming624@qq.com",
    "download_url": "https://files.pythonhosted.org/packages/d0/ac/d493276b6f8f8072681d2ccca83bfbec25edd05c680207223e3d6ad57461/pinyintokenizer-0.0.3.tar.gz",
    "platform": null,
    "description": "[![PyPI version](https://badge.fury.io/py/pinyintokenizer.svg)](https://badge.fury.io/py/pinyin-tokenizer)\n[![Downloads](https://static.pepy.tech/badge/pinyintokenizer)](https://pepy.tech/project/pinyin-tokenizer)\n[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)\n[![GitHub contributors](https://img.shields.io/github/contributors/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/graphs/contributors)\n[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)\n[![python_vesion](https://img.shields.io/badge/Python-3.5%2B-green.svg)](requirements.txt)\n[![GitHub issues](https://img.shields.io/github/issues/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/issues)\n[![Wechat Group](https://img.shields.io/badge/wechat-group-green.svg?logo=wechat)](#Contact)\n\n# Pinyin Tokenizer\npinyin tokenizer\uff08\u62fc\u97f3\u5206\u8bcd\u5668\uff09\uff0c\u5c06\u8fde\u7eed\u7684\u62fc\u97f3\u5207\u5206\u4e3a\u5355\u5b57\u62fc\u97f3\u5217\u8868\uff0c\u5f00\u7bb1\u5373\u7528\u3002python3\u5f00\u53d1\u3002\n\n\n**Guide**\n\n- [Feature](#Feature)\n- [Install](#install)\n- [Usage](#usage)\n- [Contact](#Contact)\n- [Citation](#Citation)\n- [Related-Projects](#Related-Projects)\n\n## Feature\n\n- \u57fa\u4e8e\u524d\u7f00\u6811\uff08PyTrie\uff09\u9ad8\u6548\u5feb\u901f\u628a\u8fde\u7eed\u62fc\u97f3\u5207\u5206\u4e3a\u5355\u5b57\u62fc\u97f3\u5217\u8868\uff0c\u4fbf\u4e8e\u540e\u7eed\u62fc\u97f3\u8f6c\u6c49\u5b57\u7b49\u5904\u7406\u3002\n\n## Install\n\n- Requirements and Installation\n\n```\npip install pinyintokenizer\n```\n\nor\n\n```\ngit clone https://github.com/shibing624/pinyin-tokenizer.git\ncd pinyin-tokenizer\npython setup.py install\n```\n\n\n## Usage\n\n### \u62fc\u97f3\u5207\u5206\uff08Pinyin Tokenizer\uff09\n\nexample\uff1a[examples/pinyin_tokenize_demo.py](examples/pinyin_tokenize_demo.py):\n\n\n```python\nimport sys\n\nsys.path.append('..')\nfrom pinyintokenizer import PinyinTokenizer\n\nif __name__ == '__main__':\n    m = PinyinTokenizer()\n    print(f\"{m.tokenize('wo3')}\")\n    print(f\"{m.tokenize('nihao')}\")\n    print(f\"{m.tokenize('lv3you2')}\")  # \u65c5\u6e38\n    print(f\"{m.tokenize('liudehua')}\")\n    print(f\"{m.tokenize('liu de hua')}\")  # \u5218\u5fb7\u534e\n    print(f\"{m.tokenize('womenzuogelvyougongnue')}\")  # \u6211\u4eec\u505a\u4e2a\u65c5\u6e38\u653b\u7565\n    print(f\"{m.tokenize('xi anjiaotongdaxue')}\")  # \u897f\u5b89\u4ea4\u901a\u5927\u5b66\n\n    # not support english\n    print(f\"{m.tokenize('good luck')}\")\n```\n\noutput:\n\n```shell\n(['wo'], ['3'])\n(['ni', 'hao'], [])\n(['lv', 'you'], ['3', '2'])\n(['liu', 'de', 'hua'], [])\n(['liu', 'de', 'hua'], [' ', ' '])\n(['wo', 'men', 'zuo', 'ge', 'lv', 'you', 'gong', 'nue'], [])\n(['xi', 'an', 'jiao', 'tong', 'da', 'xue'], [' '])\n(['o', 'o', 'lu'], ['g', 'd', ' ', 'c', 'k'])\n```\n- `tokenize`\u65b9\u6cd5\u8fd4\u56de\u4e24\u4e2a\u7ed3\u679c\uff0c\u7b2c\u4e00\u4e2a\u4e3a\u62fc\u97f3\u5217\u8868\uff0c\u7b2c\u4e8c\u4e2a\u4e3a\u975e\u6cd5\u62fc\u97f3\u5217\u8868\u3002\n\n\n### \u8fde\u7eed\u62fc\u97f3\u8f6c\u6c49\u5b57\uff08Pinyin2Hanzi\uff09\n\u5148\u4f7f\u7528\u672c\u5e93[pinyintokenizer](https://pypi.org/project/pinyintokenizer/)\u628a\u8fde\u7eed\u62fc\u97f3\u5207\u5206\uff0c\u518d\u4f7f\u7528[Pinyin2Hanzi](https://pypi.org/project/Pinyin2Hanzi/)\u5e93\u628a\u62fc\u97f3\u8f6c\u6c49\u5b57\u3002\n\nexample\uff1a[examples/pinyin2hanzi_demo.py](examples/pinyin2hanzi_demo.py):\n\n\n```python\nimport sys\nfrom Pinyin2Hanzi import DefaultDagParams\nfrom Pinyin2Hanzi import dag\n\nsys.path.append('..')\nfrom pinyintokenizer import PinyinTokenizer\n\ndagparams = DefaultDagParams()\n\n\ndef pinyin2hanzi(pinyin_sentence):\n    pinyin_list, _ = PinyinTokenizer().tokenize(pinyin_sentence)\n    result = dag(dagparams, pinyin_list, path_num=1)\n    return ''.join(result[0].path)\n\n\nif __name__ == '__main__':\n    print(f\"{pinyin2hanzi('wo3')}\")\n    print(f\"{pinyin2hanzi('jintianxtianqibucuo')}\")\n    print(f\"{pinyin2hanzi('liudehua')}\")\n```\n\noutput:\n\n```shell\n\u6211\n\u4eca\u5929\u5929\u6c14\u4e0d\u9519\n\u5218\u5fb7\u534e\n```\n\n\n\n## Contact\n\n- Issue(\u5efa\u8bae)\uff1a[![GitHub issues](https://img.shields.io/github/issues/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/issues)\n- \u90ae\u4ef6\u6211\uff1axuming: xuming624@qq.com\n- \u5fae\u4fe1\u6211\uff1a\u52a0\u6211*\u5fae\u4fe1\u53f7\uff1axuming624*, \u8fdbPython-NLP\u4ea4\u6d41\u7fa4\uff0c\u5907\u6ce8\uff1a*\u59d3\u540d-\u516c\u53f8\u540d-NLP*\n<img src=\"docs/wechat.jpeg\" width=\"200\" />\n\n\n## Citation\n\n\u5982\u679c\u4f60\u5728\u7814\u7a76\u4e2d\u4f7f\u7528\u4e86pinyin-tokenizer\uff0c\u8bf7\u6309\u5982\u4e0b\u683c\u5f0f\u5f15\u7528\uff1a\n\nAPA:\n```latex\nXu, M. pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP (Version 0.0.1) [Computer software]. https://github.com/shibing624/pinyin-tokenizer\n```\n\nBibTeX:\n```latex\n@misc{pinyin-tokenizer,\n  title={pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP},\n  author={Xu Ming},\n  year={2022},\n  howpublished={\\url{https://github.com/shibing624/pinyin-tokenizer}},\n}\n```\n\n\n## License\n\n\n\u6388\u6743\u534f\u8bae\u4e3a [The Apache License 2.0](LICENSE)\uff0c\u53ef\u514d\u8d39\u7528\u505a\u5546\u4e1a\u7528\u9014\u3002\u8bf7\u5728\u4ea7\u54c1\u8bf4\u660e\u4e2d\u9644\u52a0**pinyin-tokenizer**\u7684\u94fe\u63a5\u548c\u6388\u6743\u534f\u8bae\u3002\n\n\n## Contribute\n\u9879\u76ee\u4ee3\u7801\u8fd8\u5f88\u7c97\u7cd9\uff0c\u5982\u679c\u5927\u5bb6\u5bf9\u4ee3\u7801\u6709\u6240\u6539\u8fdb\uff0c\u6b22\u8fce\u63d0\u4ea4\u56de\u672c\u9879\u76ee\uff0c\u5728\u63d0\u4ea4\u4e4b\u524d\uff0c\u6ce8\u610f\u4ee5\u4e0b\u4e24\u70b9\uff1a\n\n - \u5728`tests`\u6dfb\u52a0\u76f8\u5e94\u7684\u5355\u5143\u6d4b\u8bd5\n - \u4f7f\u7528`python -m pytest`\u6765\u8fd0\u884c\u6240\u6709\u5355\u5143\u6d4b\u8bd5\uff0c\u786e\u4fdd\u6240\u6709\u5355\u6d4b\u90fd\u662f\u901a\u8fc7\u7684\n\n\u4e4b\u540e\u5373\u53ef\u63d0\u4ea4PR\u3002\n\n\n## Related Projects\n\n- \u6c49\u5b57\u8f6c\u62fc\u97f3\uff1a[pypinyin](https://github.com/mozillazg/python-pinyin)\n- \u62fc\u97f3\u8f6c\u6c49\u5b57\uff1a[Pinyin2Hanzi](https://github.com/letiantian/Pinyin2Hanzi)\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Pinyin Tokenizer, chinese pinyin tokenizer",
    "version": "0.0.3",
    "project_urls": {
        "Homepage": "https://github.com/shibing624/pinyin-tokenizer"
    },
    "split_keywords": [
        "pinyin-tokenizer",
        " pinyin",
        " pinyintokenizer",
        " tokenizer"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d0acd493276b6f8f8072681d2ccca83bfbec25edd05c680207223e3d6ad57461",
                "md5": "f10060c8fc5bee19e73f7b1a0691bb34",
                "sha256": "3a75d16cbf2d14b72ec4560dbaf1b0b5203f9b6169529055b1e08e23c848eb8f"
            },
            "downloads": -1,
            "filename": "pinyintokenizer-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "f10060c8fc5bee19e73f7b1a0691bb34",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 10996,
            "upload_time": "2025-01-28T09:07:39",
            "upload_time_iso_8601": "2025-01-28T09:07:39.573554Z",
            "url": "https://files.pythonhosted.org/packages/d0/ac/d493276b6f8f8072681d2ccca83bfbec25edd05c680207223e3d6ad57461/pinyintokenizer-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-28 09:07:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "shibing624",
    "github_project": "pinyin-tokenizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "six",
            "specs": []
        }
    ],
    "lcname": "pinyintokenizer"
}

XuMing