Name | py-pinyin-split JSON |
Version |
5.0.0
JSON |
| download |
home_page | None |
Summary | Library for splitting Hanyu Pinyin phrases into all valid syllable combinations |
upload_time | 2024-12-09 09:12:25 |
maintainer | None |
docs_url | None |
author | None |
requires_python | <3.13,>=3.8 |
license | MIT |
keywords |
chinese
pinyin
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# py-pinyin-split
A Python library for splitting Hanyu Pinyin words into syllables. Built on [NLTK's](https://github.com/nltk/nltk) [tokenizer interface](https://www.nltk.org/api/nltk.tokenize.html), it handles standard syllables defined in the [Pinyin Table](https://en.wikipedia.org/wiki/Pinyin_table) and supports tone marks.
Based originally on [pinyinsplit](https://github.com/throput/pinyinsplit) by [@tomlee](https://github.com/tomlee).
PyPI: https://pypi.org/project/py-pinyin-split/
## Installation
```bash
pip install py-pinyin-split
```
## Usage
Instantiate a tokenizer and split away.
The tokenizer can handle standard Hanyu Pinyin with whitespaces and punctuation. However, invalid pinyin syllables will raise a `ValueError`
The tokenizer uses some basic heuristics to determine the most likely split - number of syllables, presence of vowels, and syllable frequency data.
```python
from py_pinyin_split import PinyinTokenizer
tokenizer = PinyinTokenizer()
# Basic splitting
tokenizer.tokenize("nǐhǎo") # ['nǐ', 'hǎo']
tokenizer.tokenize("Běijīng") # ['Běi', 'jīng']
# Handles whitespace and punctuation
tokenizer.tokenize("Nǐ hǎo ma?") # ['Nǐ', 'hǎo', 'ma', '?']
tokenizer.tokenize("Wǒ hěn hǎo!") # ['Wǒ', 'hěn', 'hǎo', '!']
# Handles ambiguous splits using heuristics
tokenizer.tokenize("kěnéng") == ["kě", "néng"]
tokenizer.tokenize("rènào") == ["rè", "nào"]
tokenizer.tokenize("xīan") == ["xī", "an"]
tokenizer.tokenize("xián") == ["xián"]
tokenizer.tokenize("wǎn'ān") == ["wǎn", "'", "ān"]
# Tone marks or punctuation help resolve ambiguity
tokenizer.tokenize("xīān") # ['xī', 'ān']
tokenizer.tokenize("xián") # ['xián']
tokenizer.tokenize("Xī'ān") # ["Xī", "'", "ān"]
# Raises ValueError for invalid pinyin
tokenizer.tokenize("hello") # ValueError
# Optional support for non-standard syllables
tokenizer = PinyinTokenizer(include_nonstandard=True)
tokenizer.tokenize("duang") # ['duang']
```
## Related Projects
- https://pypi.org/project/pinyintokenizer/
- https://pypi.org/project/pypinyin/
- https://github.com/throput/pinyinsplit
Raw data
{
"_id": null,
"home_page": null,
"name": "py-pinyin-split",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.8",
"maintainer_email": "lstrobel <mail@lstrobel.com>",
"keywords": "chinese, pinyin",
"author": null,
"author_email": "lstrobel <mail@lstrobel.com>, Thomas Lee <thomaslee@throput.com>",
"download_url": "https://files.pythonhosted.org/packages/31/9e/4f6134653f4fcc04cbeec57506ced3d3eb5c63ecab5c5a2de626576e45cd/py_pinyin_split-5.0.0.tar.gz",
"platform": null,
"description": "# py-pinyin-split\n\nA Python library for splitting Hanyu Pinyin words into syllables. Built on [NLTK's](https://github.com/nltk/nltk) [tokenizer interface](https://www.nltk.org/api/nltk.tokenize.html), it handles standard syllables defined in the [Pinyin Table](https://en.wikipedia.org/wiki/Pinyin_table) and supports tone marks.\n\n\nBased originally on [pinyinsplit](https://github.com/throput/pinyinsplit) by [@tomlee](https://github.com/tomlee).\n\nPyPI: https://pypi.org/project/py-pinyin-split/\n\n## Installation\n\n```bash\npip install py-pinyin-split\n```\n\n## Usage\n\nInstantiate a tokenizer and split away.\n\nThe tokenizer can handle standard Hanyu Pinyin with whitespaces and punctuation. However, invalid pinyin syllables will raise a `ValueError`\n\nThe tokenizer uses some basic heuristics to determine the most likely split - number of syllables, presence of vowels, and syllable frequency data.\n\n```python\nfrom py_pinyin_split import PinyinTokenizer\n\ntokenizer = PinyinTokenizer()\n\n# Basic splitting\ntokenizer.tokenize(\"n\u01d0h\u01ceo\") # ['n\u01d0', 'h\u01ceo']\ntokenizer.tokenize(\"B\u011bij\u012bng\") # ['B\u011bi', 'j\u012bng']\n\n# Handles whitespace and punctuation\ntokenizer.tokenize(\"N\u01d0 h\u01ceo ma?\") # ['N\u01d0', 'h\u01ceo', 'ma', '?']\ntokenizer.tokenize(\"W\u01d2 h\u011bn h\u01ceo!\") # ['W\u01d2', 'h\u011bn', 'h\u01ceo', '!']\n\n# Handles ambiguous splits using heuristics\ntokenizer.tokenize(\"k\u011bn\u00e9ng\") == [\"k\u011b\", \"n\u00e9ng\"]\ntokenizer.tokenize(\"r\u00e8n\u00e0o\") == [\"r\u00e8\", \"n\u00e0o\"]\ntokenizer.tokenize(\"x\u012ban\") == [\"x\u012b\", \"an\"]\ntokenizer.tokenize(\"xi\u00e1n\") == [\"xi\u00e1n\"]\ntokenizer.tokenize(\"w\u01cen'\u0101n\") == [\"w\u01cen\", \"'\", \"\u0101n\"]\n\n# Tone marks or punctuation help resolve ambiguity\ntokenizer.tokenize(\"x\u012b\u0101n\") # ['x\u012b', '\u0101n']\ntokenizer.tokenize(\"xi\u00e1n\") # ['xi\u00e1n']\ntokenizer.tokenize(\"X\u012b'\u0101n\") # [\"X\u012b\", \"'\", \"\u0101n\"]\n\n# Raises ValueError for invalid pinyin\ntokenizer.tokenize(\"hello\") # ValueError\n\n# Optional support for non-standard syllables\ntokenizer = PinyinTokenizer(include_nonstandard=True)\ntokenizer.tokenize(\"duang\") # ['duang']\n```\n\n## Related Projects\n- https://pypi.org/project/pinyintokenizer/\n- https://pypi.org/project/pypinyin/\n- https://github.com/throput/pinyinsplit\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Library for splitting Hanyu Pinyin phrases into all valid syllable combinations",
"version": "5.0.0",
"project_urls": {
"Documentation": "https://github.com/lstrobel/py-pinyin-split#readme",
"Issues": "https://github.com/lstrobel/py-pinyin-split/issues",
"Source": "https://github.com/lstrobel/py-pinyin-split"
},
"split_keywords": [
"chinese",
" pinyin"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c7b64e068cff1bdf59625b7691c8c8fceb33dda0e01cd3facb6b69fe42f6e7e9",
"md5": "d707d599cbf54689543ac6ab08be182b",
"sha256": "05b1f74ad50a27f43977be1aab0570028146213d3ec86b2b40403d1f8f040fb9"
},
"downloads": -1,
"filename": "py_pinyin_split-5.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d707d599cbf54689543ac6ab08be182b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.8",
"size": 10296,
"upload_time": "2024-12-09T09:12:23",
"upload_time_iso_8601": "2024-12-09T09:12:23.352416Z",
"url": "https://files.pythonhosted.org/packages/c7/b6/4e068cff1bdf59625b7691c8c8fceb33dda0e01cd3facb6b69fe42f6e7e9/py_pinyin_split-5.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "319e4f6134653f4fcc04cbeec57506ced3d3eb5c63ecab5c5a2de626576e45cd",
"md5": "584718c4b8198ebd509a2fabccefd553",
"sha256": "19ebd71af5bc136ea78d8b124d1155fca91498809be464861f7273e848719e97"
},
"downloads": -1,
"filename": "py_pinyin_split-5.0.0.tar.gz",
"has_sig": false,
"md5_digest": "584718c4b8198ebd509a2fabccefd553",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.8",
"size": 34740,
"upload_time": "2024-12-09T09:12:25",
"upload_time_iso_8601": "2024-12-09T09:12:25.072306Z",
"url": "https://files.pythonhosted.org/packages/31/9e/4f6134653f4fcc04cbeec57506ced3d3eb5c63ecab5c5a2de626576e45cd/py_pinyin_split-5.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-09 09:12:25",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "lstrobel",
"github_project": "py-pinyin-split#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "py-pinyin-split"
}