py-pinyin-split

Name	py-pinyin-split JSON
Version	5.0.0 JSON
	download
home_page	None
Summary	Library for splitting Hanyu Pinyin phrases into all valid syllable combinations
upload_time	2024-12-09 09:12:25
maintainer	None
docs_url	None
author	None
requires_python	<3.13,>=3.8
license	MIT
keywords	chinese pinyin
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # py-pinyin-split

A Python library for splitting Hanyu Pinyin words into syllables. Built on [NLTK's](https://github.com/nltk/nltk) [tokenizer interface](https://www.nltk.org/api/nltk.tokenize.html), it handles standard syllables defined in the [Pinyin Table](https://en.wikipedia.org/wiki/Pinyin_table) and supports tone marks.


Based originally on [pinyinsplit](https://github.com/throput/pinyinsplit) by [@tomlee](https://github.com/tomlee).

PyPI: https://pypi.org/project/py-pinyin-split/

## Installation

```bash
pip install py-pinyin-split
```

## Usage

Instantiate a tokenizer and split away.

The tokenizer can handle standard Hanyu Pinyin with whitespaces and punctuation. However, invalid pinyin syllables will raise a `ValueError`

The tokenizer uses some basic heuristics to determine the most likely split - number of syllables, presence of vowels, and syllable frequency data.

```python
from py_pinyin_split import PinyinTokenizer

tokenizer = PinyinTokenizer()

# Basic splitting
tokenizer.tokenize("nǐhǎo")  # ['nǐ', 'hǎo']
tokenizer.tokenize("Běijīng")  # ['Běi', 'jīng']

# Handles whitespace and punctuation
tokenizer.tokenize("Nǐ hǎo ma?")  # ['Nǐ', 'hǎo', 'ma', '?']
tokenizer.tokenize("Wǒ hěn hǎo!")  # ['Wǒ', 'hěn', 'hǎo', '!']

# Handles ambiguous splits using heuristics
tokenizer.tokenize("kěnéng") == ["kě", "néng"]
tokenizer.tokenize("rènào") == ["rè", "nào"]
tokenizer.tokenize("xīan") == ["xī", "an"]
tokenizer.tokenize("xián") == ["xián"]
tokenizer.tokenize("wǎn'ān") == ["wǎn", "'", "ān"]

# Tone marks or punctuation help resolve ambiguity
tokenizer.tokenize("xīān")  # ['xī', 'ān']
tokenizer.tokenize("xián")  # ['xián']
tokenizer.tokenize("Xī'ān") # ["Xī", "'", "ān"]

# Raises ValueError for invalid pinyin
tokenizer.tokenize("hello")  # ValueError

# Optional support for non-standard syllables
tokenizer = PinyinTokenizer(include_nonstandard=True)
tokenizer.tokenize("duang")  # ['duang']
```

## Related Projects
- https://pypi.org/project/pinyintokenizer/
- https://pypi.org/project/pypinyin/
- https://github.com/throput/pinyinsplit

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "py-pinyin-split",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.8",
    "maintainer_email": "lstrobel <mail@lstrobel.com>",
    "keywords": "chinese, pinyin",
    "author": null,
    "author_email": "lstrobel <mail@lstrobel.com>, Thomas Lee <thomaslee@throput.com>",
    "download_url": "https://files.pythonhosted.org/packages/31/9e/4f6134653f4fcc04cbeec57506ced3d3eb5c63ecab5c5a2de626576e45cd/py_pinyin_split-5.0.0.tar.gz",
    "platform": null,
    "description": "# py-pinyin-split\n\nA Python library for splitting Hanyu Pinyin words into syllables. Built on [NLTK's](https://github.com/nltk/nltk) [tokenizer interface](https://www.nltk.org/api/nltk.tokenize.html), it handles standard syllables defined in the [Pinyin Table](https://en.wikipedia.org/wiki/Pinyin_table) and supports tone marks.\n\n\nBased originally on [pinyinsplit](https://github.com/throput/pinyinsplit) by [@tomlee](https://github.com/tomlee).\n\nPyPI: https://pypi.org/project/py-pinyin-split/\n\n## Installation\n\n```bash\npip install py-pinyin-split\n```\n\n## Usage\n\nInstantiate a tokenizer and split away.\n\nThe tokenizer can handle standard Hanyu Pinyin with whitespaces and punctuation. However, invalid pinyin syllables will raise a `ValueError`\n\nThe tokenizer uses some basic heuristics to determine the most likely split - number of syllables, presence of vowels, and syllable frequency data.\n\n```python\nfrom py_pinyin_split import PinyinTokenizer\n\ntokenizer = PinyinTokenizer()\n\n# Basic splitting\ntokenizer.tokenize(\"n\u01d0h\u01ceo\")  # ['n\u01d0', 'h\u01ceo']\ntokenizer.tokenize(\"B\u011bij\u012bng\")  # ['B\u011bi', 'j\u012bng']\n\n# Handles whitespace and punctuation\ntokenizer.tokenize(\"N\u01d0 h\u01ceo ma?\")  # ['N\u01d0', 'h\u01ceo', 'ma', '?']\ntokenizer.tokenize(\"W\u01d2 h\u011bn h\u01ceo!\")  # ['W\u01d2', 'h\u011bn', 'h\u01ceo', '!']\n\n# Handles ambiguous splits using heuristics\ntokenizer.tokenize(\"k\u011bn\u00e9ng\") == [\"k\u011b\", \"n\u00e9ng\"]\ntokenizer.tokenize(\"r\u00e8n\u00e0o\") == [\"r\u00e8\", \"n\u00e0o\"]\ntokenizer.tokenize(\"x\u012ban\") == [\"x\u012b\", \"an\"]\ntokenizer.tokenize(\"xi\u00e1n\") == [\"xi\u00e1n\"]\ntokenizer.tokenize(\"w\u01cen'\u0101n\") == [\"w\u01cen\", \"'\", \"\u0101n\"]\n\n# Tone marks or punctuation help resolve ambiguity\ntokenizer.tokenize(\"x\u012b\u0101n\")  # ['x\u012b', '\u0101n']\ntokenizer.tokenize(\"xi\u00e1n\")  # ['xi\u00e1n']\ntokenizer.tokenize(\"X\u012b'\u0101n\") # [\"X\u012b\", \"'\", \"\u0101n\"]\n\n# Raises ValueError for invalid pinyin\ntokenizer.tokenize(\"hello\")  # ValueError\n\n# Optional support for non-standard syllables\ntokenizer = PinyinTokenizer(include_nonstandard=True)\ntokenizer.tokenize(\"duang\")  # ['duang']\n```\n\n## Related Projects\n- https://pypi.org/project/pinyintokenizer/\n- https://pypi.org/project/pypinyin/\n- https://github.com/throput/pinyinsplit\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Library for splitting Hanyu Pinyin phrases into all valid syllable combinations",
    "version": "5.0.0",
    "project_urls": {
        "Documentation": "https://github.com/lstrobel/py-pinyin-split#readme",
        "Issues": "https://github.com/lstrobel/py-pinyin-split/issues",
        "Source": "https://github.com/lstrobel/py-pinyin-split"
    },
    "split_keywords": [
        "chinese",
        " pinyin"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c7b64e068cff1bdf59625b7691c8c8fceb33dda0e01cd3facb6b69fe42f6e7e9",
                "md5": "d707d599cbf54689543ac6ab08be182b",
                "sha256": "05b1f74ad50a27f43977be1aab0570028146213d3ec86b2b40403d1f8f040fb9"
            },
            "downloads": -1,
            "filename": "py_pinyin_split-5.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d707d599cbf54689543ac6ab08be182b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.8",
            "size": 10296,
            "upload_time": "2024-12-09T09:12:23",
            "upload_time_iso_8601": "2024-12-09T09:12:23.352416Z",
            "url": "https://files.pythonhosted.org/packages/c7/b6/4e068cff1bdf59625b7691c8c8fceb33dda0e01cd3facb6b69fe42f6e7e9/py_pinyin_split-5.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "319e4f6134653f4fcc04cbeec57506ced3d3eb5c63ecab5c5a2de626576e45cd",
                "md5": "584718c4b8198ebd509a2fabccefd553",
                "sha256": "19ebd71af5bc136ea78d8b124d1155fca91498809be464861f7273e848719e97"
            },
            "downloads": -1,
            "filename": "py_pinyin_split-5.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "584718c4b8198ebd509a2fabccefd553",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.8",
            "size": 34740,
            "upload_time": "2024-12-09T09:12:25",
            "upload_time_iso_8601": "2024-12-09T09:12:25.072306Z",
            "url": "https://files.pythonhosted.org/packages/31/9e/4f6134653f4fcc04cbeec57506ced3d3eb5c63ecab5c5a2de626576e45cd/py_pinyin_split-5.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-09 09:12:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lstrobel",
    "github_project": "py-pinyin-split#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "py-pinyin-split"
}

None