ZiCutter


NameZiCutter JSON
Version 0.0.8 PyPI version JSON
download
home_pagehttps://github.com/laohur/ZiCutter
SummaryZiCutter: cut character smaller
upload_time2023-01-10 18:10:33
maintainer
docs_urlNone
authorlaohur
requires_python>=3.0
license[Anti-996 License](https: // github.com/996icu/996.ICU/blob/master/LICENSE)
keywords zicutter unicodetokenizer zitokenizer tokenizer unicode laohur
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ZiCutter

ZiCutter: cut character smaller

## use
> pip install ZiCutter

```python
from ZiCutter import ZiCutter

line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)😀熇'"

# build
cutter = ZiCutter(dir="")
cutter.build()

# use
cutter = ZiCutter(dir="")
for c in line:
    print(cutter.cutChar(c))

```

## background
Unicode 14.0 adds 838 characters, for a total of 144,697 characters. (https://www.unicode.org/versions/Unicode14.0.0/) About 2/3 of them are HanZi. To shrink vocab size, we cut character to smaller.

## vocab
minium 
az 26 
number 10
bigram 1296
index 26
YuanZi 2365
total 3723

## cut name rare character
name = name of 'x'    
tokens=[name[:2],"#"+name[-1]]    
base: bigrams, [a~z][a~z],[0~9][0~9],#[a~z],#[0~9]    


    '😀' : name is 'GRINNING FACE'
    '😀' -> ["##gr","ce"]


## cut ids for HanZi
base: YuanZi (minium)

    熇	⿰火高    
    '熇' -> ['⿰','火','高']    





            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/laohur/ZiCutter",
    "name": "ZiCutter",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.0",
    "maintainer_email": "",
    "keywords": "ZiCutter,UnicodeTokenizer,ZiTokenizer,Tokenizer,Unicode,laohur",
    "author": "laohur",
    "author_email": "laohur@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/4e/bf/266a317ba4c6477050ce6b0d438fbb8f66c2e45dd74ee4b42994529292e0/ZiCutter-0.0.8.tar.gz",
    "platform": null,
    "description": "# ZiCutter\n\nZiCutter: cut character smaller\n\n## use\n> pip install ZiCutter\n\n```python\nfrom ZiCutter import ZiCutter\n\nline = \"\uf87f'\u3007\u33a1[\u0e04\u0e38\u0e13\u0e08\u0e30\u0e08\u0e31\u0e14\u0e1e\u0e34\u0e18\u0e35\u0e41\u0e15\u0e48\u0e07\u0e07\u0e32\u0e19\u0e40\u0e21\u0e37\u0e48\u0e2d\u0e44\u0e23\u0e04\u0e30\u0e31\u0e35\u0e34\u0e4c\u0e37\u0e47\u0e4d\u0e36]\u2167pays-g[ran]d-blanc-\u00e9lev\u00e9 \u00bb (\u767d\u9ad8\u5927\u590f\u570b)\ud83d\ude00\u7187'\"\n\n# build\ncutter = ZiCutter(dir=\"\")\ncutter.build()\n\n# use\ncutter = ZiCutter(dir=\"\")\nfor c in line:\n    print(cutter.cutChar(c))\n\n```\n\n## background\nUnicode 14.0 adds 838 characters, for a total of 144,697 characters. (https://www.unicode.org/versions/Unicode14.0.0/) About 2/3 of them are HanZi. To shrink vocab size, we cut character to smaller.\n\n## vocab\nminium \naz 26 \nnumber 10\nbigram 1296\nindex 26\nYuanZi 2365\ntotal 3723\n\n## cut name rare character\nname = name of 'x'    \ntokens=[name[:2],\"#\"+name[-1]]    \nbase: bigrams, [a~z][a~z],[0~9][0~9],#[a~z],#[0~9]    \n\n\n    '\ud83d\ude00' : name is 'GRINNING FACE'\n    '\ud83d\ude00' -> [\"##gr\",\"ce\"]\n\n\n## cut ids for HanZi\nbase: YuanZi (minium)\n\n    \u7187\t\u2ff0\u706b\u9ad8    \n    '\u7187' -> ['\u2ff0','\u706b','\u9ad8']    \n\n\n\n\n",
    "bugtrack_url": null,
    "license": "[Anti-996 License](https: // github.com/996icu/996.ICU/blob/master/LICENSE)",
    "summary": "ZiCutter: cut character smaller",
    "version": "0.0.8",
    "split_keywords": [
        "zicutter",
        "unicodetokenizer",
        "zitokenizer",
        "tokenizer",
        "unicode",
        "laohur"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8c23298c9b07743c6fc649c53cb9ed985cb3ef3ce2a94702f588c681e9d1ed62",
                "md5": "b3a173d04c3b7de69f5365558a57ad07",
                "sha256": "4ddc54a929d9573c6359a176432a0eeef15f271d8a15f752cd520be562a1bad7"
            },
            "downloads": -1,
            "filename": "ZiCutter-0.0.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b3a173d04c3b7de69f5365558a57ad07",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.0",
            "size": 1421389,
            "upload_time": "2023-01-10T18:10:31",
            "upload_time_iso_8601": "2023-01-10T18:10:31.858520Z",
            "url": "https://files.pythonhosted.org/packages/8c/23/298c9b07743c6fc649c53cb9ed985cb3ef3ce2a94702f588c681e9d1ed62/ZiCutter-0.0.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4ebf266a317ba4c6477050ce6b0d438fbb8f66c2e45dd74ee4b42994529292e0",
                "md5": "6317c4f64eb3d4b924bdc37c1e6bd4b1",
                "sha256": "6a3fd4bab8d7f236a12d1cea117197a14738703611d73b908be0beb7b747c48b"
            },
            "downloads": -1,
            "filename": "ZiCutter-0.0.8.tar.gz",
            "has_sig": false,
            "md5_digest": "6317c4f64eb3d4b924bdc37c1e6bd4b1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.0",
            "size": 1405861,
            "upload_time": "2023-01-10T18:10:33",
            "upload_time_iso_8601": "2023-01-10T18:10:33.932680Z",
            "url": "https://files.pythonhosted.org/packages/4e/bf/266a317ba4c6477050ce6b0d438fbb8f66c2e45dd74ee4b42994529292e0/ZiCutter-0.0.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-10 18:10:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "laohur",
    "github_project": "ZiCutter",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "zicutter"
}
        
Elapsed time: 0.02791s