[![PyPI version](https://badge.fury.io/py/pinyintokenizer.svg)](https://badge.fury.io/py/pinyin-tokenizer)
[![Downloads](https://pepy.tech/badge/pinyintokenizer)](https://pepy.tech/project/pinyin-tokenizer)
[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)
[![GitHub contributors](https://img.shields.io/github/contributors/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/graphs/contributors)
[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
[![python_vesion](https://img.shields.io/badge/Python-3.5%2B-green.svg)](requirements.txt)
[![GitHub issues](https://img.shields.io/github/issues/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/issues)
[![Wechat Group](http://vlog.sfyc.ltd/wechat_everyday/wxgroup_logo.png?imageView2/0/w/60/h/20)](#Contact)
# Pinyin Tokenizer
pinyin tokenizer(拼音分词器),将连续的拼音切分为单字拼音列表,开箱即用。python3开发。
**Guide**
- [Feature](#Feature)
- [Install](#install)
- [Usage](#usage)
- [Contact](#Contact)
- [Citation](#Citation)
- [Related-Projects](#Related-Projects)
# Feature
- 基于前缀树(PyTrie)高效快速把连续拼音切分为单字拼音列表,便于后续拼音转汉字等处理。
# Install
- Requirements and Installation
```
pip install pinyintokenizer
```
or
```
git clone https://github.com/shibing624/pinyin-tokenizer.git
cd pinyin-tokenizer
python setup.py install
```
# Usage
## 拼音切分(Pinyin Tokenizer)
example:[examples/pinyin_tokenize_demo.py](examples/pinyin_tokenize_demo.py):
```python
import sys
sys.path.append('..')
from pinyintokenizer import PinyinTokenizer
if __name__ == '__main__':
m = PinyinTokenizer()
print(f"{m.tokenize('wo3')}")
print(f"{m.tokenize('nihao')}")
print(f"{m.tokenize('lv3you2')}") # 旅游
print(f"{m.tokenize('liudehua')}")
print(f"{m.tokenize('liu de hua')}") # 刘德华
print(f"{m.tokenize('womenzuogelvyougongnue')}") # 我们做个旅游攻略
print(f"{m.tokenize('xi anjiaotongdaxue')}") # 西安交通大学
# not support english
print(f"{m.tokenize('good luck')}")
```
output:
```shell
(['wo'], ['3'])
(['ni', 'hao'], [])
(['lv', 'you'], ['3', '2'])
(['liu', 'de', 'hua'], [])
(['liu', 'de', 'hua'], [' ', ' '])
(['wo', 'men', 'zuo', 'ge', 'lv', 'you', 'gong', 'nue'], [])
(['xi', 'an', 'jiao', 'tong', 'da', 'xue'], [' '])
(['o', 'o', 'lu'], ['g', 'd', ' ', 'c', 'k'])
```
- `tokenize`方法返回两个结果,第一个为拼音列表,第二个为非法拼音列表。
## 连续拼音转汉字(Pinyin2Hanzi)
先使用本库[pinyintokenizer](https://pypi.org/project/pinyintokenizer/)把连续拼音切分,再使用[Pinyin2Hanzi](https://pypi.org/project/Pinyin2Hanzi/)库把拼音转汉字。
example:[examples/pinyin2hanzi_demo.py](examples/pinyin2hanzi_demo.py):
```python
import sys
from Pinyin2Hanzi import DefaultDagParams
from Pinyin2Hanzi import dag
sys.path.append('..')
from pinyintokenizer import PinyinTokenizer
dagparams = DefaultDagParams()
def pinyin2hanzi(pinyin_sentence):
pinyin_list, _ = PinyinTokenizer().tokenize(pinyin_sentence)
result = dag(dagparams, pinyin_list, path_num=1)
return ''.join(result[0].path)
if __name__ == '__main__':
print(f"{pinyin2hanzi('wo3')}")
print(f"{pinyin2hanzi('jintianxtianqibucuo')}")
print(f"{pinyin2hanzi('liudehua')}")
```
output:
```shell
我
今天天气不错
刘德华
```
# Contact
- Issue(建议):[![GitHub issues](https://img.shields.io/github/issues/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/issues)
- 邮件我:xuming: xuming624@qq.com
- 微信我:加我*微信号:xuming624*, 进Python-NLP交流群,备注:*姓名-公司名-NLP*
<img src="docs/wechat.jpeg" width="200" />
# Citation
如果你在研究中使用了pinyin-tokenizer,请按如下格式引用:
APA:
```latex
Xu, M. pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP (Version 0.0.1) [Computer software]. https://github.com/shibing624/pinyin-tokenizer
```
BibTeX:
```latex
@misc{pinyin-tokenizer,
title={pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP},
author={Xu Ming},
year={2022},
howpublished={\url{https://github.com/shibing624/pinyin-tokenizer}},
}
```
# License
授权协议为 [The Apache License 2.0](LICENSE),可免费用做商业用途。请在产品说明中附加**pinyin-tokenizer**的链接和授权协议。
# Contribute
项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:
- 在`tests`添加相应的单元测试
- 使用`python -m pytest`来运行所有单元测试,确保所有单测都是通过的
之后即可提交PR。
# Related Projects
- 汉字转拼音:[pypinyin](https://github.com/mozillazg/python-pinyin)
- 拼音转汉字:[Pinyin2Hanzi](https://github.com/letiantian/Pinyin2Hanzi)
Raw data
{
"_id": null,
"home_page": "https://github.com/shibing624/pinyin-tokenizer",
"name": "pinyintokenizer",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "pinyin-tokenizer,pinyin,pinyintokenizer,tokenizer",
"author": "XuMing",
"author_email": "xuming624@qq.com",
"download_url": "https://files.pythonhosted.org/packages/0f/e5/4baf2ab2c8241f2608f9b3b189b527f42bb86604103b23a900aa7105c07b/pinyintokenizer-0.0.2.tar.gz",
"platform": null,
"description": "[![PyPI version](https://badge.fury.io/py/pinyintokenizer.svg)](https://badge.fury.io/py/pinyin-tokenizer)\n[![Downloads](https://pepy.tech/badge/pinyintokenizer)](https://pepy.tech/project/pinyin-tokenizer)\n[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)\n[![GitHub contributors](https://img.shields.io/github/contributors/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/graphs/contributors)\n[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)\n[![python_vesion](https://img.shields.io/badge/Python-3.5%2B-green.svg)](requirements.txt)\n[![GitHub issues](https://img.shields.io/github/issues/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/issues)\n[![Wechat Group](http://vlog.sfyc.ltd/wechat_everyday/wxgroup_logo.png?imageView2/0/w/60/h/20)](#Contact)\n\n# Pinyin Tokenizer\npinyin tokenizer\uff08\u62fc\u97f3\u5206\u8bcd\u5668\uff09\uff0c\u5c06\u8fde\u7eed\u7684\u62fc\u97f3\u5207\u5206\u4e3a\u5355\u5b57\u62fc\u97f3\u5217\u8868\uff0c\u5f00\u7bb1\u5373\u7528\u3002python3\u5f00\u53d1\u3002\n\n\n**Guide**\n\n- [Feature](#Feature)\n- [Install](#install)\n- [Usage](#usage)\n- [Contact](#Contact)\n- [Citation](#Citation)\n- [Related-Projects](#Related-Projects)\n\n# Feature\n\n- \u57fa\u4e8e\u524d\u7f00\u6811\uff08PyTrie\uff09\u9ad8\u6548\u5feb\u901f\u628a\u8fde\u7eed\u62fc\u97f3\u5207\u5206\u4e3a\u5355\u5b57\u62fc\u97f3\u5217\u8868\uff0c\u4fbf\u4e8e\u540e\u7eed\u62fc\u97f3\u8f6c\u6c49\u5b57\u7b49\u5904\u7406\u3002\n\n# Install\n\n- Requirements and Installation\n\n```\npip install pinyintokenizer\n```\n\nor\n\n```\ngit clone https://github.com/shibing624/pinyin-tokenizer.git\ncd pinyin-tokenizer\npython setup.py install\n```\n\n\n# Usage\n\n## \u62fc\u97f3\u5207\u5206\uff08Pinyin Tokenizer\uff09\n\nexample\uff1a[examples/pinyin_tokenize_demo.py](examples/pinyin_tokenize_demo.py):\n\n\n```python\nimport sys\n\nsys.path.append('..')\nfrom pinyintokenizer import PinyinTokenizer\n\nif __name__ == '__main__':\n m = PinyinTokenizer()\n print(f\"{m.tokenize('wo3')}\")\n print(f\"{m.tokenize('nihao')}\")\n print(f\"{m.tokenize('lv3you2')}\") # \u65c5\u6e38\n print(f\"{m.tokenize('liudehua')}\")\n print(f\"{m.tokenize('liu de hua')}\") # \u5218\u5fb7\u534e\n print(f\"{m.tokenize('womenzuogelvyougongnue')}\") # \u6211\u4eec\u505a\u4e2a\u65c5\u6e38\u653b\u7565\n print(f\"{m.tokenize('xi anjiaotongdaxue')}\") # \u897f\u5b89\u4ea4\u901a\u5927\u5b66\n\n # not support english\n print(f\"{m.tokenize('good luck')}\")\n```\n\noutput:\n\n```shell\n(['wo'], ['3'])\n(['ni', 'hao'], [])\n(['lv', 'you'], ['3', '2'])\n(['liu', 'de', 'hua'], [])\n(['liu', 'de', 'hua'], [' ', ' '])\n(['wo', 'men', 'zuo', 'ge', 'lv', 'you', 'gong', 'nue'], [])\n(['xi', 'an', 'jiao', 'tong', 'da', 'xue'], [' '])\n(['o', 'o', 'lu'], ['g', 'd', ' ', 'c', 'k'])\n```\n- `tokenize`\u65b9\u6cd5\u8fd4\u56de\u4e24\u4e2a\u7ed3\u679c\uff0c\u7b2c\u4e00\u4e2a\u4e3a\u62fc\u97f3\u5217\u8868\uff0c\u7b2c\u4e8c\u4e2a\u4e3a\u975e\u6cd5\u62fc\u97f3\u5217\u8868\u3002\n\n\n## \u8fde\u7eed\u62fc\u97f3\u8f6c\u6c49\u5b57\uff08Pinyin2Hanzi\uff09\n\u5148\u4f7f\u7528\u672c\u5e93[pinyintokenizer](https://pypi.org/project/pinyintokenizer/)\u628a\u8fde\u7eed\u62fc\u97f3\u5207\u5206\uff0c\u518d\u4f7f\u7528[Pinyin2Hanzi](https://pypi.org/project/Pinyin2Hanzi/)\u5e93\u628a\u62fc\u97f3\u8f6c\u6c49\u5b57\u3002\n\nexample\uff1a[examples/pinyin2hanzi_demo.py](examples/pinyin2hanzi_demo.py):\n\n\n```python\nimport sys\nfrom Pinyin2Hanzi import DefaultDagParams\nfrom Pinyin2Hanzi import dag\n\nsys.path.append('..')\nfrom pinyintokenizer import PinyinTokenizer\n\ndagparams = DefaultDagParams()\n\n\ndef pinyin2hanzi(pinyin_sentence):\n pinyin_list, _ = PinyinTokenizer().tokenize(pinyin_sentence)\n result = dag(dagparams, pinyin_list, path_num=1)\n return ''.join(result[0].path)\n\n\nif __name__ == '__main__':\n print(f\"{pinyin2hanzi('wo3')}\")\n print(f\"{pinyin2hanzi('jintianxtianqibucuo')}\")\n print(f\"{pinyin2hanzi('liudehua')}\")\n```\n\noutput:\n\n```shell\n\u6211\n\u4eca\u5929\u5929\u6c14\u4e0d\u9519\n\u5218\u5fb7\u534e\n```\n\n\n\n# Contact\n\n- Issue(\u5efa\u8bae)\uff1a[![GitHub issues](https://img.shields.io/github/issues/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/issues)\n- \u90ae\u4ef6\u6211\uff1axuming: xuming624@qq.com\n- \u5fae\u4fe1\u6211\uff1a\u52a0\u6211*\u5fae\u4fe1\u53f7\uff1axuming624*, \u8fdbPython-NLP\u4ea4\u6d41\u7fa4\uff0c\u5907\u6ce8\uff1a*\u59d3\u540d-\u516c\u53f8\u540d-NLP*\n<img src=\"docs/wechat.jpeg\" width=\"200\" />\n\n\n# Citation\n\n\u5982\u679c\u4f60\u5728\u7814\u7a76\u4e2d\u4f7f\u7528\u4e86pinyin-tokenizer\uff0c\u8bf7\u6309\u5982\u4e0b\u683c\u5f0f\u5f15\u7528\uff1a\n\nAPA:\n```latex\nXu, M. pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP (Version 0.0.1) [Computer software]. https://github.com/shibing624/pinyin-tokenizer\n```\n\nBibTeX:\n```latex\n@misc{pinyin-tokenizer,\n title={pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP},\n author={Xu Ming},\n year={2022},\n howpublished={\\url{https://github.com/shibing624/pinyin-tokenizer}},\n}\n```\n\n\n# License\n\n\n\u6388\u6743\u534f\u8bae\u4e3a [The Apache License 2.0](LICENSE)\uff0c\u53ef\u514d\u8d39\u7528\u505a\u5546\u4e1a\u7528\u9014\u3002\u8bf7\u5728\u4ea7\u54c1\u8bf4\u660e\u4e2d\u9644\u52a0**pinyin-tokenizer**\u7684\u94fe\u63a5\u548c\u6388\u6743\u534f\u8bae\u3002\n\n\n# Contribute\n\u9879\u76ee\u4ee3\u7801\u8fd8\u5f88\u7c97\u7cd9\uff0c\u5982\u679c\u5927\u5bb6\u5bf9\u4ee3\u7801\u6709\u6240\u6539\u8fdb\uff0c\u6b22\u8fce\u63d0\u4ea4\u56de\u672c\u9879\u76ee\uff0c\u5728\u63d0\u4ea4\u4e4b\u524d\uff0c\u6ce8\u610f\u4ee5\u4e0b\u4e24\u70b9\uff1a\n\n - \u5728`tests`\u6dfb\u52a0\u76f8\u5e94\u7684\u5355\u5143\u6d4b\u8bd5\n - \u4f7f\u7528`python -m pytest`\u6765\u8fd0\u884c\u6240\u6709\u5355\u5143\u6d4b\u8bd5\uff0c\u786e\u4fdd\u6240\u6709\u5355\u6d4b\u90fd\u662f\u901a\u8fc7\u7684\n\n\u4e4b\u540e\u5373\u53ef\u63d0\u4ea4PR\u3002\n\n\n# Related Projects\n\n- \u6c49\u5b57\u8f6c\u62fc\u97f3\uff1a[pypinyin](https://github.com/mozillazg/python-pinyin)\n- \u62fc\u97f3\u8f6c\u6c49\u5b57\uff1a[Pinyin2Hanzi](https://github.com/letiantian/Pinyin2Hanzi)",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Pinyin Tokenizer, chinese pinyin tokenizer",
"version": "0.0.2",
"project_urls": {
"Homepage": "https://github.com/shibing624/pinyin-tokenizer"
},
"split_keywords": [
"pinyin-tokenizer",
"pinyin",
"pinyintokenizer",
"tokenizer"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0fe54baf2ab2c8241f2608f9b3b189b527f42bb86604103b23a900aa7105c07b",
"md5": "5bb7a7675939a0ded6b904f2f13afa98",
"sha256": "b30b265801b4c6f7b1bd8eb4eebdb6c18dee273133234d27b3a65ab8d1e80228"
},
"downloads": -1,
"filename": "pinyintokenizer-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "5bb7a7675939a0ded6b904f2f13afa98",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 11226,
"upload_time": "2023-06-08T08:04:40",
"upload_time_iso_8601": "2023-06-08T08:04:40.778159Z",
"url": "https://files.pythonhosted.org/packages/0f/e5/4baf2ab2c8241f2608f9b3b189b527f42bb86604103b23a900aa7105c07b/pinyintokenizer-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-08 08:04:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "shibing624",
"github_project": "pinyin-tokenizer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "pinyintokenizer"
}