pyuegc


Namepyuegc JSON
Version 16.0.0 PyPI version JSON
download
home_pagehttps://github.com/mlodewijck/pyuegc
SummaryAn implementation of the Unicode algorithm for breaking code point sequences into extended grapheme clusters as specified in UAX #29.
upload_time2024-09-17 16:35:38
maintainerNone
docs_urlNone
authorMarc Lodewijck
requires_python>=3.6
licenseMIT
keywords unicode unicode grapheme clusters extended grapheme clusters egc grapheme clusters graphemes segmentation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pyuegc
A pure Python implementation of the Unicode algorithm for breaking strings of text (i.e., code point sequences) into **extended grapheme clusters** (“user-perceived characters”) as specified in UAX #29, “Unicode Text Segmentation.” This package conforms to version 16.0 of the Unicode standard, released in September 2024, and has been rigorously tested against the official [Unicode test file](https://www.unicode.org/Public/16.0.0/ucd/auxiliary/GraphemeBreakTest.txt) to ensure accuracy.

### Installation and updates
To install the package, run:
```shell
pip install pyuegc
```

To upgrade to the latest version, run:
```shell
pip install pyuegc --upgrade
```

### Unicode character database (UCD) version
To retrieve the version of the Unicode character database in use:
```python
>>> from pyuegc import UCD_VERSION
>>> UCD_VERSION
'16.0.0'
```

### Example usage
```python
from pyuegc import EGC

def _output(unistr, egc):
    return f"""\
# String: {unistr}
# Length of string: {len(unistr)}
# EGC: {egc}
# Length of EGC: {len(egc)}
"""

unistr = "Python"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: Python
# Length of string: 6
# EGC: ['P', 'y', 't', 'h', 'o', 'n']
# Length of EGC: 6

unistr = "e\u0301le\u0300ve"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: élève
# Length of string: 7
# EGC: ['é', 'l', 'è', 'v', 'e']
# Length of EGC: 5

unistr = "Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒
# Length of string: 20
# EGC: ['Z̷̳̎', 'a̸̛ͅ', 'l̷̻̇', 'g̵͉̉', 'o̸̰͒']
# Length of EGC: 5

unistr = "기운찰만하다"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: 기운찰만하다
# Length of string: 15
# EGC: ['기', '운', '찰', '만', '하', '다']
# Length of EGC: 6

unistr = "পৌষসংক্রান্তির"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: পৌষসংক্রান্তির
# Length of string: 14
# EGC: ['পৌ', 'ষ', 'সং', 'ক্রা', 'ন্তি', 'র']
# Length of EGC: 6
```

Reversing a string directly may mess up diacritics, whereas reversing using EGC correctly preserves the visual appearance of characters regardless of the Unicode normalization form:
```python
unistr = "ai\u0302ne\u0301e"  # aînée

print(f"# Reversed string: {''.join(reversed(unistr))!r}")
# Reversed string: 'éen̂ia'

print(f"# EGC processed and reversed: {''.join(reversed(EGC(unistr)))!r}")
# EGC processed and reversed: 'eénîa'
```

### Related resources
This implementation is based on the following resources:
- [“Grapheme Clusters,” in the Unicode core specification, version 16.0.0](https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G52443)
- [Unicode Standard Annex #29: Unicode Text Segmentation, revision 45](https://www.unicode.org/reports/tr29/tr29-45.html)

### Licenses
The code is licensed under the [MIT license](https://github.com/mlodewijck/pyuegc/blob/main/LICENSE).

Usage of Unicode data files is governed by the [UNICODE TERMS OF USE](https://www.unicode.org/copyright.html). Further specifications of rights and restrictions pertaining to the use of the Unicode data files and software can be found in the [Unicode Data Files and Software License](https://www.unicode.org/license.txt), a copy of which is included as [UNICODE-LICENSE](https://github.com/mlodewijck/pyuegc/blob/main/UNICODE-LICENSE).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/mlodewijck/pyuegc",
    "name": "pyuegc",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "Unicode, Unicode grapheme clusters, extended grapheme clusters, EGC, grapheme clusters, graphemes, segmentation",
    "author": "Marc Lodewijck",
    "author_email": "mlodewijck@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/8d/ca/17940664f94f4087995a72872eb9a9b1bbed60bd2311b3372bd87dfad207/pyuegc-16.0.0.tar.gz",
    "platform": null,
    "description": "# pyuegc\r\nA pure Python implementation of the Unicode algorithm for breaking strings of text (i.e., code point sequences) into **extended grapheme clusters** (\u201cuser-perceived characters\u201d) as specified in UAX #29, \u201cUnicode Text Segmentation.\u201d This package conforms to version 16.0 of the Unicode standard, released in September 2024, and has been rigorously tested against the official [Unicode test file](https://www.unicode.org/Public/16.0.0/ucd/auxiliary/GraphemeBreakTest.txt) to ensure accuracy.\r\n\r\n### Installation and updates\r\nTo install the package, run:\r\n```shell\r\npip install pyuegc\r\n```\r\n\r\nTo upgrade to the latest version, run:\r\n```shell\r\npip install pyuegc --upgrade\r\n```\r\n\r\n### Unicode character database (UCD) version\r\nTo retrieve the version of the Unicode character database in use:\r\n```python\r\n>>> from pyuegc import UCD_VERSION\r\n>>> UCD_VERSION\r\n'16.0.0'\r\n```\r\n\r\n### Example usage\r\n```python\r\nfrom pyuegc import EGC\r\n\r\ndef _output(unistr, egc):\r\n    return f\"\"\"\\\r\n# String: {unistr}\r\n# Length of string: {len(unistr)}\r\n# EGC: {egc}\r\n# Length of EGC: {len(egc)}\r\n\"\"\"\r\n\r\nunistr = \"Python\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: Python\r\n# Length of string: 6\r\n# EGC: ['P', 'y', 't', 'h', 'o', 'n']\r\n# Length of EGC: 6\r\n\r\nunistr = \"e\\u0301le\\u0300ve\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: e\u0301le\u0300ve\r\n# Length of string: 7\r\n# EGC: ['e\u0301', 'l', 'e\u0300', 'v', 'e']\r\n# Length of EGC: 5\r\n\r\nunistr = \"Z\u0337\u030e\u0333a\u0338\u031b\u0345l\u0337\u0307\u033bg\u0335\u0309\u0349o\u0338\u0352\u0330\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: Z\u0337\u030e\u0333a\u0338\u031b\u0345l\u0337\u0307\u033bg\u0335\u0309\u0349o\u0338\u0352\u0330\r\n# Length of string: 20\r\n# EGC: ['Z\u0337\u030e\u0333', 'a\u0338\u031b\u0345', 'l\u0337\u0307\u033b', 'g\u0335\u0309\u0349', 'o\u0338\u0352\u0330']\r\n# Length of EGC: 5\r\n\r\nunistr = \"\u1100\u1175\u110b\u116e\u11ab\u110e\u1161\u11af\u1106\u1161\u11ab\u1112\u1161\u1103\u1161\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: \u1100\u1175\u110b\u116e\u11ab\u110e\u1161\u11af\u1106\u1161\u11ab\u1112\u1161\u1103\u1161\r\n# Length of string: 15\r\n# EGC: ['\u1100\u1175', '\u110b\u116e\u11ab', '\u110e\u1161\u11af', '\u1106\u1161\u11ab', '\u1112\u1161', '\u1103\u1161']\r\n# Length of EGC: 6\r\n\r\nunistr = \"\u09aa\u09cc\u09b7\u09b8\u0982\u0995\u09cd\u09b0\u09be\u09a8\u09cd\u09a4\u09bf\u09b0\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: \u09aa\u09cc\u09b7\u09b8\u0982\u0995\u09cd\u09b0\u09be\u09a8\u09cd\u09a4\u09bf\u09b0\r\n# Length of string: 14\r\n# EGC: ['\u09aa\u09cc', '\u09b7', '\u09b8\u0982', '\u0995\u09cd\u09b0\u09be', '\u09a8\u09cd\u09a4\u09bf', '\u09b0']\r\n# Length of EGC: 6\r\n```\r\n\r\nReversing a string directly may mess up diacritics, whereas reversing using EGC correctly preserves the visual appearance of characters regardless of the Unicode normalization form:\r\n```python\r\nunistr = \"ai\\u0302ne\\u0301e\"  # ai\u0302ne\u0301e\r\n\r\nprint(f\"# Reversed string: {''.join(reversed(unistr))!r}\")\r\n# Reversed string: 'e\u0301en\u0302ia'\r\n\r\nprint(f\"# EGC processed and reversed: {''.join(reversed(EGC(unistr)))!r}\")\r\n# EGC processed and reversed: 'ee\u0301ni\u0302a'\r\n```\r\n\r\n### Related resources\r\nThis implementation is based on the following resources:\r\n- [\u201cGrapheme Clusters,\u201d in the Unicode core specification, version 16.0.0](https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G52443)\r\n- [Unicode Standard Annex #29: Unicode Text Segmentation, revision 45](https://www.unicode.org/reports/tr29/tr29-45.html)\r\n\r\n### Licenses\r\nThe code is licensed under the [MIT license](https://github.com/mlodewijck/pyuegc/blob/main/LICENSE).\r\n\r\nUsage of Unicode data files is governed by the [UNICODE TERMS OF USE](https://www.unicode.org/copyright.html). Further specifications of rights and restrictions pertaining to the use of the Unicode data files and software can be found in the [Unicode Data Files and Software License](https://www.unicode.org/license.txt), a copy of which is included as [UNICODE-LICENSE](https://github.com/mlodewijck/pyuegc/blob/main/UNICODE-LICENSE).\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An implementation of the Unicode algorithm for breaking code point sequences into extended grapheme clusters as specified in UAX #29.",
    "version": "16.0.0",
    "project_urls": {
        "Bug Reports": "https://github.com/mlodewijck/pyuegc/issues",
        "Homepage": "https://github.com/mlodewijck/pyuegc",
        "Source": "https://github.com/mlodewijck/pyuegc/"
    },
    "split_keywords": [
        "unicode",
        " unicode grapheme clusters",
        " extended grapheme clusters",
        " egc",
        " grapheme clusters",
        " graphemes",
        " segmentation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "92835d0fc2f479a91fcf5609b742b40dccafa6a29488fb92613c9ecc89410c5a",
                "md5": "1010e81631ab27a81d9c6e8f39d41af1",
                "sha256": "67fde95f6e0de7a7840e0557e91035bd4a6ea776bb4a5efc444f8fdba1b938d2"
            },
            "downloads": -1,
            "filename": "pyuegc-16.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1010e81631ab27a81d9c6e8f39d41af1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 62347,
            "upload_time": "2024-09-17T16:35:36",
            "upload_time_iso_8601": "2024-09-17T16:35:36.832699Z",
            "url": "https://files.pythonhosted.org/packages/92/83/5d0fc2f479a91fcf5609b742b40dccafa6a29488fb92613c9ecc89410c5a/pyuegc-16.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8dca17940664f94f4087995a72872eb9a9b1bbed60bd2311b3372bd87dfad207",
                "md5": "580fd5dc000222694f4af488f500839c",
                "sha256": "897a0e3e6f36636a0e03ba38fef0af404efdc2b7fe2d5dea3d18b995d5ba5c36"
            },
            "downloads": -1,
            "filename": "pyuegc-16.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "580fd5dc000222694f4af488f500839c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 64724,
            "upload_time": "2024-09-17T16:35:38",
            "upload_time_iso_8601": "2024-09-17T16:35:38.482302Z",
            "url": "https://files.pythonhosted.org/packages/8d/ca/17940664f94f4087995a72872eb9a9b1bbed60bd2311b3372bd87dfad207/pyuegc-16.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-17 16:35:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mlodewijck",
    "github_project": "pyuegc",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "tox": true,
    "lcname": "pyuegc"
}
        
Elapsed time: 1.28669s