# ZiCutter
ZiCutter: cut character smaller
## use
> pip install ZiCutter
```python
from ZiCutter import ZiCutter
line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)😀熇'"
# build
cutter = ZiCutter(dir="")
cutter.build()
# use
cutter = ZiCutter(dir="")
for c in line:
print(cutter.cutChar(c))
```
## background
Unicode 14.0 adds 838 characters, for a total of 144,697 characters. (https://www.unicode.org/versions/Unicode14.0.0/) About 2/3 of them are HanZi. To shrink vocab size, we cut character to smaller.
## vocab
minium
az 26
number 10
bigram 1296
index 26
YuanZi 2365
total 3723
## cut name rare character
name = name of 'x'
tokens=[name[:2],"#"+name[-1]]
base: bigrams, [a~z][a~z],[0~9][0~9],#[a~z],#[0~9]
'😀' : name is 'GRINNING FACE'
'😀' -> ["##gr","ce"]
## cut ids for HanZi
base: YuanZi (minium)
熇 ⿰火高
'熇' -> ['⿰','火','高']
Raw data
{
"_id": null,
"home_page": "https://github.com/laohur/ZiCutter",
"name": "ZiCutter",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.0",
"maintainer_email": "",
"keywords": "ZiCutter,UnicodeTokenizer,ZiTokenizer,Tokenizer,Unicode,laohur",
"author": "laohur",
"author_email": "laohur@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/4e/bf/266a317ba4c6477050ce6b0d438fbb8f66c2e45dd74ee4b42994529292e0/ZiCutter-0.0.8.tar.gz",
"platform": null,
"description": "# ZiCutter\n\nZiCutter: cut character smaller\n\n## use\n> pip install ZiCutter\n\n```python\nfrom ZiCutter import ZiCutter\n\nline = \"\uf87f'\u3007\u33a1[\u0e04\u0e38\u0e13\u0e08\u0e30\u0e08\u0e31\u0e14\u0e1e\u0e34\u0e18\u0e35\u0e41\u0e15\u0e48\u0e07\u0e07\u0e32\u0e19\u0e40\u0e21\u0e37\u0e48\u0e2d\u0e44\u0e23\u0e04\u0e30\u0e31\u0e35\u0e34\u0e4c\u0e37\u0e47\u0e4d\u0e36]\u2167pays-g[ran]d-blanc-\u00e9lev\u00e9 \u00bb (\u767d\u9ad8\u5927\u590f\u570b)\ud83d\ude00\u7187'\"\n\n# build\ncutter = ZiCutter(dir=\"\")\ncutter.build()\n\n# use\ncutter = ZiCutter(dir=\"\")\nfor c in line:\n print(cutter.cutChar(c))\n\n```\n\n## background\nUnicode 14.0 adds 838 characters, for a total of 144,697 characters. (https://www.unicode.org/versions/Unicode14.0.0/) About 2/3 of them are HanZi. To shrink vocab size, we cut character to smaller.\n\n## vocab\nminium \naz 26 \nnumber 10\nbigram 1296\nindex 26\nYuanZi 2365\ntotal 3723\n\n## cut name rare character\nname = name of 'x' \ntokens=[name[:2],\"#\"+name[-1]] \nbase: bigrams, [a~z][a~z],[0~9][0~9],#[a~z],#[0~9] \n\n\n '\ud83d\ude00' : name is 'GRINNING FACE'\n '\ud83d\ude00' -> [\"##gr\",\"ce\"]\n\n\n## cut ids for HanZi\nbase: YuanZi (minium)\n\n \u7187\t\u2ff0\u706b\u9ad8 \n '\u7187' -> ['\u2ff0','\u706b','\u9ad8'] \n\n\n\n\n",
"bugtrack_url": null,
"license": "[Anti-996 License](https: // github.com/996icu/996.ICU/blob/master/LICENSE)",
"summary": "ZiCutter: cut character smaller",
"version": "0.0.8",
"split_keywords": [
"zicutter",
"unicodetokenizer",
"zitokenizer",
"tokenizer",
"unicode",
"laohur"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8c23298c9b07743c6fc649c53cb9ed985cb3ef3ce2a94702f588c681e9d1ed62",
"md5": "b3a173d04c3b7de69f5365558a57ad07",
"sha256": "4ddc54a929d9573c6359a176432a0eeef15f271d8a15f752cd520be562a1bad7"
},
"downloads": -1,
"filename": "ZiCutter-0.0.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b3a173d04c3b7de69f5365558a57ad07",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.0",
"size": 1421389,
"upload_time": "2023-01-10T18:10:31",
"upload_time_iso_8601": "2023-01-10T18:10:31.858520Z",
"url": "https://files.pythonhosted.org/packages/8c/23/298c9b07743c6fc649c53cb9ed985cb3ef3ce2a94702f588c681e9d1ed62/ZiCutter-0.0.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4ebf266a317ba4c6477050ce6b0d438fbb8f66c2e45dd74ee4b42994529292e0",
"md5": "6317c4f64eb3d4b924bdc37c1e6bd4b1",
"sha256": "6a3fd4bab8d7f236a12d1cea117197a14738703611d73b908be0beb7b747c48b"
},
"downloads": -1,
"filename": "ZiCutter-0.0.8.tar.gz",
"has_sig": false,
"md5_digest": "6317c4f64eb3d4b924bdc37c1e6bd4b1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.0",
"size": 1405861,
"upload_time": "2023-01-10T18:10:33",
"upload_time_iso_8601": "2023-01-10T18:10:33.932680Z",
"url": "https://files.pythonhosted.org/packages/4e/bf/266a317ba4c6477050ce6b0d438fbb8f66c2e45dd74ee4b42994529292e0/ZiCutter-0.0.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-01-10 18:10:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "laohur",
"github_project": "ZiCutter",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "zicutter"
}