pinyintokenizer


Namepinyintokenizer JSON
Version 0.0.2 PyPI version JSON
download
home_pagehttps://github.com/shibing624/pinyin-tokenizer
SummaryPinyin Tokenizer, chinese pinyin tokenizer
upload_time2023-06-08 08:04:40
maintainer
docs_urlNone
authorXuMing
requires_python
licenseApache 2.0
keywords pinyin-tokenizer pinyin pinyintokenizer tokenizer
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![PyPI version](https://badge.fury.io/py/pinyintokenizer.svg)](https://badge.fury.io/py/pinyin-tokenizer)
[![Downloads](https://pepy.tech/badge/pinyintokenizer)](https://pepy.tech/project/pinyin-tokenizer)
[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)
[![GitHub contributors](https://img.shields.io/github/contributors/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/graphs/contributors)
[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
[![python_vesion](https://img.shields.io/badge/Python-3.5%2B-green.svg)](requirements.txt)
[![GitHub issues](https://img.shields.io/github/issues/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/issues)
[![Wechat Group](http://vlog.sfyc.ltd/wechat_everyday/wxgroup_logo.png?imageView2/0/w/60/h/20)](#Contact)

# Pinyin Tokenizer
pinyin tokenizer(拼音分词器),将连续的拼音切分为单字拼音列表,开箱即用。python3开发。


**Guide**

- [Feature](#Feature)
- [Install](#install)
- [Usage](#usage)
- [Contact](#Contact)
- [Citation](#Citation)
- [Related-Projects](#Related-Projects)

# Feature

- 基于前缀树(PyTrie)高效快速把连续拼音切分为单字拼音列表,便于后续拼音转汉字等处理。

# Install

- Requirements and Installation

```
pip install pinyintokenizer
```

or

```
git clone https://github.com/shibing624/pinyin-tokenizer.git
cd pinyin-tokenizer
python setup.py install
```


# Usage

## 拼音切分(Pinyin Tokenizer)

example:[examples/pinyin_tokenize_demo.py](examples/pinyin_tokenize_demo.py):


```python
import sys

sys.path.append('..')
from pinyintokenizer import PinyinTokenizer

if __name__ == '__main__':
    m = PinyinTokenizer()
    print(f"{m.tokenize('wo3')}")
    print(f"{m.tokenize('nihao')}")
    print(f"{m.tokenize('lv3you2')}")  # 旅游
    print(f"{m.tokenize('liudehua')}")
    print(f"{m.tokenize('liu de hua')}")  # 刘德华
    print(f"{m.tokenize('womenzuogelvyougongnue')}")  # 我们做个旅游攻略
    print(f"{m.tokenize('xi anjiaotongdaxue')}")  # 西安交通大学

    # not support english
    print(f"{m.tokenize('good luck')}")
```

output:

```shell
(['wo'], ['3'])
(['ni', 'hao'], [])
(['lv', 'you'], ['3', '2'])
(['liu', 'de', 'hua'], [])
(['liu', 'de', 'hua'], [' ', ' '])
(['wo', 'men', 'zuo', 'ge', 'lv', 'you', 'gong', 'nue'], [])
(['xi', 'an', 'jiao', 'tong', 'da', 'xue'], [' '])
(['o', 'o', 'lu'], ['g', 'd', ' ', 'c', 'k'])
```
- `tokenize`方法返回两个结果,第一个为拼音列表,第二个为非法拼音列表。


## 连续拼音转汉字(Pinyin2Hanzi)
先使用本库[pinyintokenizer](https://pypi.org/project/pinyintokenizer/)把连续拼音切分,再使用[Pinyin2Hanzi](https://pypi.org/project/Pinyin2Hanzi/)库把拼音转汉字。

example:[examples/pinyin2hanzi_demo.py](examples/pinyin2hanzi_demo.py):


```python
import sys
from Pinyin2Hanzi import DefaultDagParams
from Pinyin2Hanzi import dag

sys.path.append('..')
from pinyintokenizer import PinyinTokenizer

dagparams = DefaultDagParams()


def pinyin2hanzi(pinyin_sentence):
    pinyin_list, _ = PinyinTokenizer().tokenize(pinyin_sentence)
    result = dag(dagparams, pinyin_list, path_num=1)
    return ''.join(result[0].path)


if __name__ == '__main__':
    print(f"{pinyin2hanzi('wo3')}")
    print(f"{pinyin2hanzi('jintianxtianqibucuo')}")
    print(f"{pinyin2hanzi('liudehua')}")
```

output:

```shell
我
今天天气不错
刘德华
```



# Contact

- Issue(建议):[![GitHub issues](https://img.shields.io/github/issues/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/issues)
- 邮件我:xuming: xuming624@qq.com
- 微信我:加我*微信号:xuming624*, 进Python-NLP交流群,备注:*姓名-公司名-NLP*
<img src="docs/wechat.jpeg" width="200" />


# Citation

如果你在研究中使用了pinyin-tokenizer,请按如下格式引用:

APA:
```latex
Xu, M. pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP (Version 0.0.1) [Computer software]. https://github.com/shibing624/pinyin-tokenizer
```

BibTeX:
```latex
@misc{pinyin-tokenizer,
  title={pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP},
  author={Xu Ming},
  year={2022},
  howpublished={\url{https://github.com/shibing624/pinyin-tokenizer}},
}
```


# License


授权协议为 [The Apache License 2.0](LICENSE),可免费用做商业用途。请在产品说明中附加**pinyin-tokenizer**的链接和授权协议。


# Contribute
项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:

 - 在`tests`添加相应的单元测试
 - 使用`python -m pytest`来运行所有单元测试,确保所有单测都是通过的

之后即可提交PR。


# Related Projects

- 汉字转拼音:[pypinyin](https://github.com/mozillazg/python-pinyin)
- 拼音转汉字:[Pinyin2Hanzi](https://github.com/letiantian/Pinyin2Hanzi)
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/shibing624/pinyin-tokenizer",
    "name": "pinyintokenizer",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "pinyin-tokenizer,pinyin,pinyintokenizer,tokenizer",
    "author": "XuMing",
    "author_email": "xuming624@qq.com",
    "download_url": "https://files.pythonhosted.org/packages/0f/e5/4baf2ab2c8241f2608f9b3b189b527f42bb86604103b23a900aa7105c07b/pinyintokenizer-0.0.2.tar.gz",
    "platform": null,
    "description": "[![PyPI version](https://badge.fury.io/py/pinyintokenizer.svg)](https://badge.fury.io/py/pinyin-tokenizer)\n[![Downloads](https://pepy.tech/badge/pinyintokenizer)](https://pepy.tech/project/pinyin-tokenizer)\n[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)\n[![GitHub contributors](https://img.shields.io/github/contributors/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/graphs/contributors)\n[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)\n[![python_vesion](https://img.shields.io/badge/Python-3.5%2B-green.svg)](requirements.txt)\n[![GitHub issues](https://img.shields.io/github/issues/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/issues)\n[![Wechat Group](http://vlog.sfyc.ltd/wechat_everyday/wxgroup_logo.png?imageView2/0/w/60/h/20)](#Contact)\n\n# Pinyin Tokenizer\npinyin tokenizer\uff08\u62fc\u97f3\u5206\u8bcd\u5668\uff09\uff0c\u5c06\u8fde\u7eed\u7684\u62fc\u97f3\u5207\u5206\u4e3a\u5355\u5b57\u62fc\u97f3\u5217\u8868\uff0c\u5f00\u7bb1\u5373\u7528\u3002python3\u5f00\u53d1\u3002\n\n\n**Guide**\n\n- [Feature](#Feature)\n- [Install](#install)\n- [Usage](#usage)\n- [Contact](#Contact)\n- [Citation](#Citation)\n- [Related-Projects](#Related-Projects)\n\n# Feature\n\n- \u57fa\u4e8e\u524d\u7f00\u6811\uff08PyTrie\uff09\u9ad8\u6548\u5feb\u901f\u628a\u8fde\u7eed\u62fc\u97f3\u5207\u5206\u4e3a\u5355\u5b57\u62fc\u97f3\u5217\u8868\uff0c\u4fbf\u4e8e\u540e\u7eed\u62fc\u97f3\u8f6c\u6c49\u5b57\u7b49\u5904\u7406\u3002\n\n# Install\n\n- Requirements and Installation\n\n```\npip install pinyintokenizer\n```\n\nor\n\n```\ngit clone https://github.com/shibing624/pinyin-tokenizer.git\ncd pinyin-tokenizer\npython setup.py install\n```\n\n\n# Usage\n\n## \u62fc\u97f3\u5207\u5206\uff08Pinyin Tokenizer\uff09\n\nexample\uff1a[examples/pinyin_tokenize_demo.py](examples/pinyin_tokenize_demo.py):\n\n\n```python\nimport sys\n\nsys.path.append('..')\nfrom pinyintokenizer import PinyinTokenizer\n\nif __name__ == '__main__':\n    m = PinyinTokenizer()\n    print(f\"{m.tokenize('wo3')}\")\n    print(f\"{m.tokenize('nihao')}\")\n    print(f\"{m.tokenize('lv3you2')}\")  # \u65c5\u6e38\n    print(f\"{m.tokenize('liudehua')}\")\n    print(f\"{m.tokenize('liu de hua')}\")  # \u5218\u5fb7\u534e\n    print(f\"{m.tokenize('womenzuogelvyougongnue')}\")  # \u6211\u4eec\u505a\u4e2a\u65c5\u6e38\u653b\u7565\n    print(f\"{m.tokenize('xi anjiaotongdaxue')}\")  # \u897f\u5b89\u4ea4\u901a\u5927\u5b66\n\n    # not support english\n    print(f\"{m.tokenize('good luck')}\")\n```\n\noutput:\n\n```shell\n(['wo'], ['3'])\n(['ni', 'hao'], [])\n(['lv', 'you'], ['3', '2'])\n(['liu', 'de', 'hua'], [])\n(['liu', 'de', 'hua'], [' ', ' '])\n(['wo', 'men', 'zuo', 'ge', 'lv', 'you', 'gong', 'nue'], [])\n(['xi', 'an', 'jiao', 'tong', 'da', 'xue'], [' '])\n(['o', 'o', 'lu'], ['g', 'd', ' ', 'c', 'k'])\n```\n- `tokenize`\u65b9\u6cd5\u8fd4\u56de\u4e24\u4e2a\u7ed3\u679c\uff0c\u7b2c\u4e00\u4e2a\u4e3a\u62fc\u97f3\u5217\u8868\uff0c\u7b2c\u4e8c\u4e2a\u4e3a\u975e\u6cd5\u62fc\u97f3\u5217\u8868\u3002\n\n\n## \u8fde\u7eed\u62fc\u97f3\u8f6c\u6c49\u5b57\uff08Pinyin2Hanzi\uff09\n\u5148\u4f7f\u7528\u672c\u5e93[pinyintokenizer](https://pypi.org/project/pinyintokenizer/)\u628a\u8fde\u7eed\u62fc\u97f3\u5207\u5206\uff0c\u518d\u4f7f\u7528[Pinyin2Hanzi](https://pypi.org/project/Pinyin2Hanzi/)\u5e93\u628a\u62fc\u97f3\u8f6c\u6c49\u5b57\u3002\n\nexample\uff1a[examples/pinyin2hanzi_demo.py](examples/pinyin2hanzi_demo.py):\n\n\n```python\nimport sys\nfrom Pinyin2Hanzi import DefaultDagParams\nfrom Pinyin2Hanzi import dag\n\nsys.path.append('..')\nfrom pinyintokenizer import PinyinTokenizer\n\ndagparams = DefaultDagParams()\n\n\ndef pinyin2hanzi(pinyin_sentence):\n    pinyin_list, _ = PinyinTokenizer().tokenize(pinyin_sentence)\n    result = dag(dagparams, pinyin_list, path_num=1)\n    return ''.join(result[0].path)\n\n\nif __name__ == '__main__':\n    print(f\"{pinyin2hanzi('wo3')}\")\n    print(f\"{pinyin2hanzi('jintianxtianqibucuo')}\")\n    print(f\"{pinyin2hanzi('liudehua')}\")\n```\n\noutput:\n\n```shell\n\u6211\n\u4eca\u5929\u5929\u6c14\u4e0d\u9519\n\u5218\u5fb7\u534e\n```\n\n\n\n# Contact\n\n- Issue(\u5efa\u8bae)\uff1a[![GitHub issues](https://img.shields.io/github/issues/shibing624/pinyin-tokenizer.svg)](https://github.com/shibing624/pinyin-tokenizer/issues)\n- \u90ae\u4ef6\u6211\uff1axuming: xuming624@qq.com\n- \u5fae\u4fe1\u6211\uff1a\u52a0\u6211*\u5fae\u4fe1\u53f7\uff1axuming624*, \u8fdbPython-NLP\u4ea4\u6d41\u7fa4\uff0c\u5907\u6ce8\uff1a*\u59d3\u540d-\u516c\u53f8\u540d-NLP*\n<img src=\"docs/wechat.jpeg\" width=\"200\" />\n\n\n# Citation\n\n\u5982\u679c\u4f60\u5728\u7814\u7a76\u4e2d\u4f7f\u7528\u4e86pinyin-tokenizer\uff0c\u8bf7\u6309\u5982\u4e0b\u683c\u5f0f\u5f15\u7528\uff1a\n\nAPA:\n```latex\nXu, M. pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP (Version 0.0.1) [Computer software]. https://github.com/shibing624/pinyin-tokenizer\n```\n\nBibTeX:\n```latex\n@misc{pinyin-tokenizer,\n  title={pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP},\n  author={Xu Ming},\n  year={2022},\n  howpublished={\\url{https://github.com/shibing624/pinyin-tokenizer}},\n}\n```\n\n\n# License\n\n\n\u6388\u6743\u534f\u8bae\u4e3a [The Apache License 2.0](LICENSE)\uff0c\u53ef\u514d\u8d39\u7528\u505a\u5546\u4e1a\u7528\u9014\u3002\u8bf7\u5728\u4ea7\u54c1\u8bf4\u660e\u4e2d\u9644\u52a0**pinyin-tokenizer**\u7684\u94fe\u63a5\u548c\u6388\u6743\u534f\u8bae\u3002\n\n\n# Contribute\n\u9879\u76ee\u4ee3\u7801\u8fd8\u5f88\u7c97\u7cd9\uff0c\u5982\u679c\u5927\u5bb6\u5bf9\u4ee3\u7801\u6709\u6240\u6539\u8fdb\uff0c\u6b22\u8fce\u63d0\u4ea4\u56de\u672c\u9879\u76ee\uff0c\u5728\u63d0\u4ea4\u4e4b\u524d\uff0c\u6ce8\u610f\u4ee5\u4e0b\u4e24\u70b9\uff1a\n\n - \u5728`tests`\u6dfb\u52a0\u76f8\u5e94\u7684\u5355\u5143\u6d4b\u8bd5\n - \u4f7f\u7528`python -m pytest`\u6765\u8fd0\u884c\u6240\u6709\u5355\u5143\u6d4b\u8bd5\uff0c\u786e\u4fdd\u6240\u6709\u5355\u6d4b\u90fd\u662f\u901a\u8fc7\u7684\n\n\u4e4b\u540e\u5373\u53ef\u63d0\u4ea4PR\u3002\n\n\n# Related Projects\n\n- \u6c49\u5b57\u8f6c\u62fc\u97f3\uff1a[pypinyin](https://github.com/mozillazg/python-pinyin)\n- \u62fc\u97f3\u8f6c\u6c49\u5b57\uff1a[Pinyin2Hanzi](https://github.com/letiantian/Pinyin2Hanzi)",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Pinyin Tokenizer, chinese pinyin tokenizer",
    "version": "0.0.2",
    "project_urls": {
        "Homepage": "https://github.com/shibing624/pinyin-tokenizer"
    },
    "split_keywords": [
        "pinyin-tokenizer",
        "pinyin",
        "pinyintokenizer",
        "tokenizer"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0fe54baf2ab2c8241f2608f9b3b189b527f42bb86604103b23a900aa7105c07b",
                "md5": "5bb7a7675939a0ded6b904f2f13afa98",
                "sha256": "b30b265801b4c6f7b1bd8eb4eebdb6c18dee273133234d27b3a65ab8d1e80228"
            },
            "downloads": -1,
            "filename": "pinyintokenizer-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "5bb7a7675939a0ded6b904f2f13afa98",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 11226,
            "upload_time": "2023-06-08T08:04:40",
            "upload_time_iso_8601": "2023-06-08T08:04:40.778159Z",
            "url": "https://files.pythonhosted.org/packages/0f/e5/4baf2ab2c8241f2608f9b3b189b527f42bb86604103b23a900aa7105c07b/pinyintokenizer-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-08 08:04:40",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "shibing624",
    "github_project": "pinyin-tokenizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "pinyintokenizer"
}
        
Elapsed time: 0.08805s