pyuegc

Name	pyuegc JSON
Version	15.1.0 JSON
	download
home_page	https://github.com/mlodewijck/pyuegc
Summary	An implementation of the Unicode algorithm for breaking code point sequences into extended grapheme clusters as specified in UAX #29
upload_time	2023-11-11 17:46:43
maintainer
docs_url	None
author	Marc Lodewijck
requires_python	>=3.6
license	MIT
keywords	unicode unicode grapheme clusters extended grapheme cluster egc grapheme cluster graphemes segmentation
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # pyuegc
An implementation of the Unicode algorithm for breaking strings of text (i.e., code point sequences) into **extended grapheme clusters** (“user-perceived characters”) as specified in UAX #29, “Unicode Text Segmentation”. This package supports version&nbsp;15.1 of the Unicode standard (released in September&nbsp;2023). It has been thoroughly tested against the [Unicode test file](https://www.unicode.org/Public/15.1.0/ucd/auxiliary/GraphemeBreakTest.txt).

### Installation
The easiest method to install is using pip:
```shell
pip install pyuegc
```

### UCD version
To get the version of the Unicode character database currently used:
```python
>>> from pyuegc import UCD_VERSION
>>> UCD_VERSION
'15.1.0'
```

### Example usage
```python
from pyuegc import EGC


def _output(unistr, egc):
    return f"""\
# String: {unistr}
# Length of string: {len(unistr)}
# EGC: {egc}
# Length of EGC: {len(egc)}
"""

unistr = "Python"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: Python
# Length of string: 6
# EGC: ['P', 'y', 't', 'h', 'o', 'n']
# Length of EGC: 6

unistr = "e\u0301le\u0300ve"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: élève
# Length of string: 7
# EGC: ['é', 'l', 'è', 'v', 'e']
# Length of EGC: 5

unistr = "Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒
# Length of string: 20
# EGC: ['Z̷̳̎', 'a̸̛ͅ', 'l̷̻̇', 'g̵͉̉', 'o̸̰͒']
# Length of EGC: 5

unistr = "기운찰만하다"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: 기운찰만하다
# Length of string: 15
# EGC: ['기', '운', '찰', '만', '하', '다']
# Length of EGC: 6

unistr = "পৌষসংক্রান্তির"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: পৌষসংক্রান্তির
# Length of string: 14
# EGC: ['পৌ', 'ষ', 'সং', 'ক্রা', 'ন্তি', 'র']
# Length of EGC: 6


unistr = "ai\u0302ne\u0301e"  # aînée
print(f"# Reversed string:\n#   {''.join(reversed(unistr))}")
print(f"# Reversed EGC:   \n#   {''.join(reversed(EGC(unistr)))}")
# Reversed string:
#   éen̂ia -> wrong (diacritics are messed up)
# Reversed EGC:
#   eénîa -> right (regardless of the Unicode normalization form)
```

### Related resources
This implementation is based on the following resources:
- [“Grapheme Clusters”, in the Unicode core specification, version&nbsp;15.1.0](https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf#G52443)
- [Unicode Standard Annex #29: Unicode Text Segmentation, version&nbsp;43](https://www.unicode.org/reports/tr29/tr29-43.html)

### Licenses
The code is available under the [MIT license](https://github.com/mlodewijck/pyuegc/blob/main/LICENSE).

Usage of Unicode data files is governed by the [UNICODE TERMS OF USE](https://www.unicode.org/copyright.html). Further specifications of rights and restrictions pertaining to the use of the Unicode data files and software can be found in the [Unicode Data Files and Software License](https://www.unicode.org/license.txt), a copy of which is included as [UNICODE-LICENSE](https://github.com/mlodewijck/pyunormalize/blob/master/UNICODE-LICENSE).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/mlodewijck/pyuegc",
    "name": "pyuegc",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "Unicode,Unicode grapheme clusters,extended grapheme cluster,EGC,grapheme cluster,graphemes,segmentation",
    "author": "Marc Lodewijck",
    "author_email": "mlodewijck@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/05/c2/fa08cb7a50bfee73015d8c286f9c622fc7ecc8e7c5810c6627f3884d0a44/pyuegc-15.1.0.tar.gz",
    "platform": null,
    "description": "# pyuegc\r\nAn implementation of the Unicode algorithm for breaking strings of text (i.e., code point sequences) into **extended grapheme clusters** (\u201cuser-perceived characters\u201d) as specified in UAX #29, \u201cUnicode Text Segmentation\u201d. This package supports version&nbsp;15.1 of the Unicode standard (released in September&nbsp;2023). It has been thoroughly tested against the [Unicode test file](https://www.unicode.org/Public/15.1.0/ucd/auxiliary/GraphemeBreakTest.txt).\r\n\r\n### Installation\r\nThe easiest method to install is using pip:\r\n```shell\r\npip install pyuegc\r\n```\r\n\r\n### UCD version\r\nTo get the version of the Unicode character database currently used:\r\n```python\r\n>>> from pyuegc import UCD_VERSION\r\n>>> UCD_VERSION\r\n'15.1.0'\r\n```\r\n\r\n### Example usage\r\n```python\r\nfrom pyuegc import EGC\r\n\r\n\r\ndef _output(unistr, egc):\r\n    return f\"\"\"\\\r\n# String: {unistr}\r\n# Length of string: {len(unistr)}\r\n# EGC: {egc}\r\n# Length of EGC: {len(egc)}\r\n\"\"\"\r\n\r\nunistr = \"Python\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: Python\r\n# Length of string: 6\r\n# EGC: ['P', 'y', 't', 'h', 'o', 'n']\r\n# Length of EGC: 6\r\n\r\nunistr = \"e\\u0301le\\u0300ve\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: e\u0301le\u0300ve\r\n# Length of string: 7\r\n# EGC: ['e\u0301', 'l', 'e\u0300', 'v', 'e']\r\n# Length of EGC: 5\r\n\r\nunistr = \"Z\u0337\u030e\u0333a\u0338\u031b\u0345l\u0337\u0307\u033bg\u0335\u0309\u0349o\u0338\u0352\u0330\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: Z\u0337\u030e\u0333a\u0338\u031b\u0345l\u0337\u0307\u033bg\u0335\u0309\u0349o\u0338\u0352\u0330\r\n# Length of string: 20\r\n# EGC: ['Z\u0337\u030e\u0333', 'a\u0338\u031b\u0345', 'l\u0337\u0307\u033b', 'g\u0335\u0309\u0349', 'o\u0338\u0352\u0330']\r\n# Length of EGC: 5\r\n\r\nunistr = \"\u1100\u1175\u110b\u116e\u11ab\u110e\u1161\u11af\u1106\u1161\u11ab\u1112\u1161\u1103\u1161\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: \u1100\u1175\u110b\u116e\u11ab\u110e\u1161\u11af\u1106\u1161\u11ab\u1112\u1161\u1103\u1161\r\n# Length of string: 15\r\n# EGC: ['\u1100\u1175', '\u110b\u116e\u11ab', '\u110e\u1161\u11af', '\u1106\u1161\u11ab', '\u1112\u1161', '\u1103\u1161']\r\n# Length of EGC: 6\r\n\r\nunistr = \"\u09aa\u09cc\u09b7\u09b8\u0982\u0995\u09cd\u09b0\u09be\u09a8\u09cd\u09a4\u09bf\u09b0\"\r\negc = EGC(unistr)\r\nprint(_output(unistr, egc))\r\n# String: \u09aa\u09cc\u09b7\u09b8\u0982\u0995\u09cd\u09b0\u09be\u09a8\u09cd\u09a4\u09bf\u09b0\r\n# Length of string: 14\r\n# EGC: ['\u09aa\u09cc', '\u09b7', '\u09b8\u0982', '\u0995\u09cd\u09b0\u09be', '\u09a8\u09cd\u09a4\u09bf', '\u09b0']\r\n# Length of EGC: 6\r\n\r\n\r\nunistr = \"ai\\u0302ne\\u0301e\"  # ai\u0302ne\u0301e\r\nprint(f\"# Reversed string:\\n#   {''.join(reversed(unistr))}\")\r\nprint(f\"# Reversed EGC:   \\n#   {''.join(reversed(EGC(unistr)))}\")\r\n# Reversed string:\r\n#   e\u0301en\u0302ia -> wrong (diacritics are messed up)\r\n# Reversed EGC:\r\n#   ee\u0301ni\u0302a -> right (regardless of the Unicode normalization form)\r\n```\r\n\r\n### Related resources\r\nThis implementation is based on the following resources:\r\n- [\u201cGrapheme Clusters\u201d, in the Unicode core specification, version&nbsp;15.1.0](https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf#G52443)\r\n- [Unicode Standard Annex #29: Unicode Text Segmentation, version&nbsp;43](https://www.unicode.org/reports/tr29/tr29-43.html)\r\n\r\n### Licenses\r\nThe code is available under the [MIT license](https://github.com/mlodewijck/pyuegc/blob/main/LICENSE).\r\n\r\nUsage of Unicode data files is governed by the [UNICODE TERMS OF USE](https://www.unicode.org/copyright.html). Further specifications of rights and restrictions pertaining to the use of the Unicode data files and software can be found in the [Unicode Data Files and Software License](https://www.unicode.org/license.txt), a copy of which is included as [UNICODE-LICENSE](https://github.com/mlodewijck/pyunormalize/blob/master/UNICODE-LICENSE).\r\n\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An implementation of the Unicode algorithm for breaking code point sequences into extended grapheme clusters as specified in UAX #29",
    "version": "15.1.0",
    "project_urls": {
        "Bug Reports": "https://github.com/mlodewijck/pyuegc/issues",
        "Homepage": "https://github.com/mlodewijck/pyuegc",
        "Source": "https://github.com/mlodewijck/pyuegc/"
    },
    "split_keywords": [
        "unicode",
        "unicode grapheme clusters",
        "extended grapheme cluster",
        "egc",
        "grapheme cluster",
        "graphemes",
        "segmentation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "05c2fa08cb7a50bfee73015d8c286f9c622fc7ecc8e7c5810c6627f3884d0a44",
                "md5": "19c0baaad7bb38523b6deee578f3cd13",
                "sha256": "0786d5ca191997f183ffa7413240cac86100bb746117a1c0502314436db38e0b"
            },
            "downloads": -1,
            "filename": "pyuegc-15.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "19c0baaad7bb38523b6deee578f3cd13",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 81309,
            "upload_time": "2023-11-11T17:46:43",
            "upload_time_iso_8601": "2023-11-11T17:46:43.114745Z",
            "url": "https://files.pythonhosted.org/packages/05/c2/fa08cb7a50bfee73015d8c286f9c622fc7ecc8e7c5810c6627f3884d0a44/pyuegc-15.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-11 17:46:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mlodewijck",
    "github_project": "pyuegc",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "pyuegc"
}

Marc Lodewijck