# pyuegc
A pure Python implementation of the Unicode algorithm for breaking strings of text (i.e., code point sequences) into **extended grapheme clusters** (“user-perceived characters”) as specified in UAX #29, “Unicode Text Segmentation.” This package conforms to version 16.0 of the Unicode standard, released in September 2024, and has been rigorously tested against the official [Unicode test file](https://www.unicode.org/Public/16.0.0/ucd/auxiliary/GraphemeBreakTest.txt) to ensure accuracy.
### Installation and updates
To install the package, run:
```shell
pip install pyuegc
```
To upgrade to the latest version, run:
```shell
pip install pyuegc --upgrade
```
### Unicode character database (UCD) version
To retrieve the version of the Unicode character database in use:
```python
>>> from pyuegc import UCD_VERSION
>>> UCD_VERSION
'16.0.0'
```
### Example usage
```python
from pyuegc import EGC
def _output(unistr, egc):
return f"""\
# String: {unistr}
# Length of string: {len(unistr)}
# EGC: {egc}
# Length of EGC: {len(egc)}
"""
unistr = "Python"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: Python
# Length of string: 6
# EGC: ['P', 'y', 't', 'h', 'o', 'n']
# Length of EGC: 6
unistr = "e\u0301le\u0300ve"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: élève
# Length of string: 7
# EGC: ['é', 'l', 'è', 'v', 'e']
# Length of EGC: 5
unistr = "Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒
# Length of string: 20
# EGC: ['Z̷̳̎', 'a̸̛ͅ', 'l̷̻̇', 'g̵͉̉', 'o̸̰͒']
# Length of EGC: 5
unistr = "기운찰만하다"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: 기운찰만하다
# Length of string: 15
# EGC: ['기', '운', '찰', '만', '하', '다']
# Length of EGC: 6
unistr = "পৌষসংক্রান্তির"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: পৌষসংক্রান্তির
# Length of string: 14
# EGC: ['পৌ', 'ষ', 'সং', 'ক্রা', 'ন্তি', 'র']
# Length of EGC: 6
```
Reversing a string directly may mess up diacritics, whereas reversing using EGC correctly preserves the visual appearance of characters regardless of the Unicode normalization form:
```python
unistr = "ai\u0302ne\u0301e" # aînée
print(f"# Reversed string: {''.join(reversed(unistr))!r}")
# Reversed string: 'éen̂ia'
print(f"# EGC processed and reversed: {''.join(reversed(EGC(unistr)))!r}")
# EGC processed and reversed: 'eénîa'
```
### Related resources
This implementation is based on the following resources:
- [“Grapheme Clusters,” in the Unicode core specification, version 16.0.0](https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G52443)
- [Unicode Standard Annex #29: Unicode Text Segmentation, revision 45](https://www.unicode.org/reports/tr29/tr29-45.html)
### Licenses
The code is licensed under the [MIT license](https://github.com/mlodewijck/pyuegc/blob/main/LICENSE).
Usage of Unicode data files is governed by the [UNICODE TERMS OF USE](https://www.unicode.org/copyright.html). Further specifications of rights and restrictions pertaining to the use of the Unicode data files and software can be found in the [Unicode Data Files and Software License](https://www.unicode.org/license.txt), a copy of which is included as [UNICODE-LICENSE](https://github.com/mlodewijck/pyuegc/blob/main/UNICODE-LICENSE).
Raw data
{
"_id": null,
"home_page": "https://github.com/mlodewijck/pyuegc",
"name": "pyuegc",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "Unicode, Unicode grapheme clusters, extended grapheme clusters, EGC, grapheme clusters, graphemes, segmentation",
"author": "Marc Lodewijck",
"author_email": "mlodewijck@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/8d/ca/17940664f94f4087995a72872eb9a9b1bbed60bd2311b3372bd87dfad207/pyuegc-16.0.0.tar.gz",
"platform": null,
"description": "# pyuegc\r\nA pure Python implementation of the Unicode algorithm for breaking strings of text (i.e., code point sequences) into **extended grapheme clusters** (\u201cuser-perceived characters\u201d) as specified in UAX #29, \u201cUnicode Text Segmentation.\u201d This package conforms to version 16.0 of the Unicode standard, released in September 2024, and has been rigorously tested against the official [Unicode test file](https://www.unicode.org/Public/16.0.0/ucd/auxiliary/GraphemeBreakTest.txt) to ensure accuracy.\r\n\r\n### Installation and updates\r\nTo install the package, run:\r\n```shell\r\npip install pyuegc\r\n```\r\n\r\nTo upgrade to the latest version, run:\r\n```shell\r\npip install pyuegc --upgrade\r\n```\r\n\r\n### Unicode character database (UCD) version\r\nTo retrieve the version of the Unicode character database in use:\r\n```python\r\n>>> from pyuegc import UCD_VERSION\r\n>>> UCD_VERSION\r\n'16.0.0'\r\n```\r\n\r\n### Example usage\r\n```python\r\nfrom pyuegc import EGC\r\n\r\ndef _output(unistr, egc):\r\n return f\"\"\"\\\r\n# String: {unistr}\r\n# Length of string: {len(unistr)}\r\n# EGC: {egc}\r\n# Length of EGC: {len(egc)}\r\n\"\"\"\r\n\r\nunistr = \"Python\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: Python\r\n# Length of string: 6\r\n# EGC: ['P', 'y', 't', 'h', 'o', 'n']\r\n# Length of EGC: 6\r\n\r\nunistr = \"e\\u0301le\\u0300ve\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: e\u0301le\u0300ve\r\n# Length of string: 7\r\n# EGC: ['e\u0301', 'l', 'e\u0300', 'v', 'e']\r\n# Length of EGC: 5\r\n\r\nunistr = \"Z\u0337\u030e\u0333a\u0338\u031b\u0345l\u0337\u0307\u033bg\u0335\u0309\u0349o\u0338\u0352\u0330\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: Z\u0337\u030e\u0333a\u0338\u031b\u0345l\u0337\u0307\u033bg\u0335\u0309\u0349o\u0338\u0352\u0330\r\n# Length of string: 20\r\n# EGC: ['Z\u0337\u030e\u0333', 'a\u0338\u031b\u0345', 'l\u0337\u0307\u033b', 'g\u0335\u0309\u0349', 'o\u0338\u0352\u0330']\r\n# Length of EGC: 5\r\n\r\nunistr = \"\u1100\u1175\u110b\u116e\u11ab\u110e\u1161\u11af\u1106\u1161\u11ab\u1112\u1161\u1103\u1161\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: \u1100\u1175\u110b\u116e\u11ab\u110e\u1161\u11af\u1106\u1161\u11ab\u1112\u1161\u1103\u1161\r\n# Length of string: 15\r\n# EGC: ['\u1100\u1175', '\u110b\u116e\u11ab', '\u110e\u1161\u11af', '\u1106\u1161\u11ab', '\u1112\u1161', '\u1103\u1161']\r\n# Length of EGC: 6\r\n\r\nunistr = \"\u09aa\u09cc\u09b7\u09b8\u0982\u0995\u09cd\u09b0\u09be\u09a8\u09cd\u09a4\u09bf\u09b0\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: \u09aa\u09cc\u09b7\u09b8\u0982\u0995\u09cd\u09b0\u09be\u09a8\u09cd\u09a4\u09bf\u09b0\r\n# Length of string: 14\r\n# EGC: ['\u09aa\u09cc', '\u09b7', '\u09b8\u0982', '\u0995\u09cd\u09b0\u09be', '\u09a8\u09cd\u09a4\u09bf', '\u09b0']\r\n# Length of EGC: 6\r\n```\r\n\r\nReversing a string directly may mess up diacritics, whereas reversing using EGC correctly preserves the visual appearance of characters regardless of the Unicode normalization form:\r\n```python\r\nunistr = \"ai\\u0302ne\\u0301e\" # ai\u0302ne\u0301e\r\n\r\nprint(f\"# Reversed string: {''.join(reversed(unistr))!r}\")\r\n# Reversed string: 'e\u0301en\u0302ia'\r\n\r\nprint(f\"# EGC processed and reversed: {''.join(reversed(EGC(unistr)))!r}\")\r\n# EGC processed and reversed: 'ee\u0301ni\u0302a'\r\n```\r\n\r\n### Related resources\r\nThis implementation is based on the following resources:\r\n- [\u201cGrapheme Clusters,\u201d in the Unicode core specification, version 16.0.0](https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G52443)\r\n- [Unicode Standard Annex #29: Unicode Text Segmentation, revision 45](https://www.unicode.org/reports/tr29/tr29-45.html)\r\n\r\n### Licenses\r\nThe code is licensed under the [MIT license](https://github.com/mlodewijck/pyuegc/blob/main/LICENSE).\r\n\r\nUsage of Unicode data files is governed by the [UNICODE TERMS OF USE](https://www.unicode.org/copyright.html). Further specifications of rights and restrictions pertaining to the use of the Unicode data files and software can be found in the [Unicode Data Files and Software License](https://www.unicode.org/license.txt), a copy of which is included as [UNICODE-LICENSE](https://github.com/mlodewijck/pyuegc/blob/main/UNICODE-LICENSE).\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "An implementation of the Unicode algorithm for breaking code point sequences into extended grapheme clusters as specified in UAX #29.",
"version": "16.0.0",
"project_urls": {
"Bug Reports": "https://github.com/mlodewijck/pyuegc/issues",
"Homepage": "https://github.com/mlodewijck/pyuegc",
"Source": "https://github.com/mlodewijck/pyuegc/"
},
"split_keywords": [
"unicode",
" unicode grapheme clusters",
" extended grapheme clusters",
" egc",
" grapheme clusters",
" graphemes",
" segmentation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "92835d0fc2f479a91fcf5609b742b40dccafa6a29488fb92613c9ecc89410c5a",
"md5": "1010e81631ab27a81d9c6e8f39d41af1",
"sha256": "67fde95f6e0de7a7840e0557e91035bd4a6ea776bb4a5efc444f8fdba1b938d2"
},
"downloads": -1,
"filename": "pyuegc-16.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1010e81631ab27a81d9c6e8f39d41af1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 62347,
"upload_time": "2024-09-17T16:35:36",
"upload_time_iso_8601": "2024-09-17T16:35:36.832699Z",
"url": "https://files.pythonhosted.org/packages/92/83/5d0fc2f479a91fcf5609b742b40dccafa6a29488fb92613c9ecc89410c5a/pyuegc-16.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "8dca17940664f94f4087995a72872eb9a9b1bbed60bd2311b3372bd87dfad207",
"md5": "580fd5dc000222694f4af488f500839c",
"sha256": "897a0e3e6f36636a0e03ba38fef0af404efdc2b7fe2d5dea3d18b995d5ba5c36"
},
"downloads": -1,
"filename": "pyuegc-16.0.0.tar.gz",
"has_sig": false,
"md5_digest": "580fd5dc000222694f4af488f500839c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 64724,
"upload_time": "2024-09-17T16:35:38",
"upload_time_iso_8601": "2024-09-17T16:35:38.482302Z",
"url": "https://files.pythonhosted.org/packages/8d/ca/17940664f94f4087995a72872eb9a9b1bbed60bd2311b3372bd87dfad207/pyuegc-16.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-17 16:35:38",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mlodewijck",
"github_project": "pyuegc",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"tox": true,
"lcname": "pyuegc"
}