# pyuegc
An implementation of the Unicode algorithm for breaking strings of text (i.e., code point sequences) into **extended grapheme clusters** (“user-perceived characters”) as specified in UAX #29, “Unicode Text Segmentation”. This package supports version 15.1 of the Unicode standard (released in September 2023). It has been thoroughly tested against the [Unicode test file](https://www.unicode.org/Public/15.1.0/ucd/auxiliary/GraphemeBreakTest.txt).
### Installation
The easiest method to install is using pip:
```shell
pip install pyuegc
```
### UCD version
To get the version of the Unicode character database currently used:
```python
>>> from pyuegc import UCD_VERSION
>>> UCD_VERSION
'15.1.0'
```
### Example usage
```python
from pyuegc import EGC
def _output(unistr, egc):
return f"""\
# String: {unistr}
# Length of string: {len(unistr)}
# EGC: {egc}
# Length of EGC: {len(egc)}
"""
unistr = "Python"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: Python
# Length of string: 6
# EGC: ['P', 'y', 't', 'h', 'o', 'n']
# Length of EGC: 6
unistr = "e\u0301le\u0300ve"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: élève
# Length of string: 7
# EGC: ['é', 'l', 'è', 'v', 'e']
# Length of EGC: 5
unistr = "Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒
# Length of string: 20
# EGC: ['Z̷̳̎', 'a̸̛ͅ', 'l̷̻̇', 'g̵͉̉', 'o̸̰͒']
# Length of EGC: 5
unistr = "기운찰만하다"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: 기운찰만하다
# Length of string: 15
# EGC: ['기', '운', '찰', '만', '하', '다']
# Length of EGC: 6
unistr = "পৌষসংক্রান্তির"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: পৌষসংক্রান্তির
# Length of string: 14
# EGC: ['পৌ', 'ষ', 'সং', 'ক্রা', 'ন্তি', 'র']
# Length of EGC: 6
unistr = "ai\u0302ne\u0301e" # aînée
print(f"# Reversed string:\n# {''.join(reversed(unistr))}")
print(f"# Reversed EGC: \n# {''.join(reversed(EGC(unistr)))}")
# Reversed string:
# éen̂ia -> wrong (diacritics are messed up)
# Reversed EGC:
# eénîa -> right (regardless of the Unicode normalization form)
```
### Related resources
This implementation is based on the following resources:
- [“Grapheme Clusters”, in the Unicode core specification, version 15.1.0](https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf#G52443)
- [Unicode Standard Annex #29: Unicode Text Segmentation, version 43](https://www.unicode.org/reports/tr29/tr29-43.html)
### Licenses
The code is available under the [MIT license](https://github.com/mlodewijck/pyuegc/blob/main/LICENSE).
Usage of Unicode data files is governed by the [UNICODE TERMS OF USE](https://www.unicode.org/copyright.html). Further specifications of rights and restrictions pertaining to the use of the Unicode data files and software can be found in the [Unicode Data Files and Software License](https://www.unicode.org/license.txt), a copy of which is included as [UNICODE-LICENSE](https://github.com/mlodewijck/pyunormalize/blob/master/UNICODE-LICENSE).
Raw data
{
"_id": null,
"home_page": "https://github.com/mlodewijck/pyuegc",
"name": "pyuegc",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "Unicode,Unicode grapheme clusters,extended grapheme cluster,EGC,grapheme cluster,graphemes,segmentation",
"author": "Marc Lodewijck",
"author_email": "mlodewijck@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/05/c2/fa08cb7a50bfee73015d8c286f9c622fc7ecc8e7c5810c6627f3884d0a44/pyuegc-15.1.0.tar.gz",
"platform": null,
"description": "# pyuegc\r\nAn implementation of the Unicode algorithm for breaking strings of text (i.e., code point sequences) into **extended grapheme clusters** (\u201cuser-perceived characters\u201d) as specified in UAX #29, \u201cUnicode Text Segmentation\u201d. This package supports version 15.1 of the Unicode standard (released in September 2023). It has been thoroughly tested against the [Unicode test file](https://www.unicode.org/Public/15.1.0/ucd/auxiliary/GraphemeBreakTest.txt).\r\n\r\n### Installation\r\nThe easiest method to install is using pip:\r\n```shell\r\npip install pyuegc\r\n```\r\n\r\n### UCD version\r\nTo get the version of the Unicode character database currently used:\r\n```python\r\n>>> from pyuegc import UCD_VERSION\r\n>>> UCD_VERSION\r\n'15.1.0'\r\n```\r\n\r\n### Example usage\r\n```python\r\nfrom pyuegc import EGC\r\n\r\n\r\ndef _output(unistr, egc):\r\n return f\"\"\"\\\r\n# String: {unistr}\r\n# Length of string: {len(unistr)}\r\n# EGC: {egc}\r\n# Length of EGC: {len(egc)}\r\n\"\"\"\r\n\r\nunistr = \"Python\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: Python\r\n# Length of string: 6\r\n# EGC: ['P', 'y', 't', 'h', 'o', 'n']\r\n# Length of EGC: 6\r\n\r\nunistr = \"e\\u0301le\\u0300ve\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: e\u0301le\u0300ve\r\n# Length of string: 7\r\n# EGC: ['e\u0301', 'l', 'e\u0300', 'v', 'e']\r\n# Length of EGC: 5\r\n\r\nunistr = \"Z\u0337\u030e\u0333a\u0338\u031b\u0345l\u0337\u0307\u033bg\u0335\u0309\u0349o\u0338\u0352\u0330\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: Z\u0337\u030e\u0333a\u0338\u031b\u0345l\u0337\u0307\u033bg\u0335\u0309\u0349o\u0338\u0352\u0330\r\n# Length of string: 20\r\n# EGC: ['Z\u0337\u030e\u0333', 'a\u0338\u031b\u0345', 'l\u0337\u0307\u033b', 'g\u0335\u0309\u0349', 'o\u0338\u0352\u0330']\r\n# Length of EGC: 5\r\n\r\nunistr = \"\u1100\u1175\u110b\u116e\u11ab\u110e\u1161\u11af\u1106\u1161\u11ab\u1112\u1161\u1103\u1161\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: \u1100\u1175\u110b\u116e\u11ab\u110e\u1161\u11af\u1106\u1161\u11ab\u1112\u1161\u1103\u1161\r\n# Length of string: 15\r\n# EGC: ['\u1100\u1175', '\u110b\u116e\u11ab', '\u110e\u1161\u11af', '\u1106\u1161\u11ab', '\u1112\u1161', '\u1103\u1161']\r\n# Length of EGC: 6\r\n\r\nunistr = \"\u09aa\u09cc\u09b7\u09b8\u0982\u0995\u09cd\u09b0\u09be\u09a8\u09cd\u09a4\u09bf\u09b0\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: \u09aa\u09cc\u09b7\u09b8\u0982\u0995\u09cd\u09b0\u09be\u09a8\u09cd\u09a4\u09bf\u09b0\r\n# Length of string: 14\r\n# EGC: ['\u09aa\u09cc', '\u09b7', '\u09b8\u0982', '\u0995\u09cd\u09b0\u09be', '\u09a8\u09cd\u09a4\u09bf', '\u09b0']\r\n# Length of EGC: 6\r\n\r\n\r\nunistr = \"ai\\u0302ne\\u0301e\" # ai\u0302ne\u0301e\r\nprint(f\"# Reversed string:\\n# {''.join(reversed(unistr))}\")\r\nprint(f\"# Reversed EGC: \\n# {''.join(reversed(EGC(unistr)))}\")\r\n# Reversed string:\r\n# e\u0301en\u0302ia -> wrong (diacritics are messed up)\r\n# Reversed EGC:\r\n# ee\u0301ni\u0302a -> right (regardless of the Unicode normalization form)\r\n```\r\n\r\n### Related resources\r\nThis implementation is based on the following resources:\r\n- [\u201cGrapheme Clusters\u201d, in the Unicode core specification, version 15.1.0](https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf#G52443)\r\n- [Unicode Standard Annex #29: Unicode Text Segmentation, version 43](https://www.unicode.org/reports/tr29/tr29-43.html)\r\n\r\n### Licenses\r\nThe code is available under the [MIT license](https://github.com/mlodewijck/pyuegc/blob/main/LICENSE).\r\n\r\nUsage of Unicode data files is governed by the [UNICODE TERMS OF USE](https://www.unicode.org/copyright.html). Further specifications of rights and restrictions pertaining to the use of the Unicode data files and software can be found in the [Unicode Data Files and Software License](https://www.unicode.org/license.txt), a copy of which is included as [UNICODE-LICENSE](https://github.com/mlodewijck/pyunormalize/blob/master/UNICODE-LICENSE).\r\n\r\n\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "An implementation of the Unicode algorithm for breaking code point sequences into extended grapheme clusters as specified in UAX #29",
"version": "15.1.0",
"project_urls": {
"Bug Reports": "https://github.com/mlodewijck/pyuegc/issues",
"Homepage": "https://github.com/mlodewijck/pyuegc",
"Source": "https://github.com/mlodewijck/pyuegc/"
},
"split_keywords": [
"unicode",
"unicode grapheme clusters",
"extended grapheme cluster",
"egc",
"grapheme cluster",
"graphemes",
"segmentation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "05c2fa08cb7a50bfee73015d8c286f9c622fc7ecc8e7c5810c6627f3884d0a44",
"md5": "19c0baaad7bb38523b6deee578f3cd13",
"sha256": "0786d5ca191997f183ffa7413240cac86100bb746117a1c0502314436db38e0b"
},
"downloads": -1,
"filename": "pyuegc-15.1.0.tar.gz",
"has_sig": false,
"md5_digest": "19c0baaad7bb38523b6deee578f3cd13",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 81309,
"upload_time": "2023-11-11T17:46:43",
"upload_time_iso_8601": "2023-11-11T17:46:43.114745Z",
"url": "https://files.pythonhosted.org/packages/05/c2/fa08cb7a50bfee73015d8c286f9c622fc7ecc8e7c5810c6627f3884d0a44/pyuegc-15.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-11 17:46:43",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mlodewijck",
"github_project": "pyuegc",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pyuegc"
}