# ZiTokenizer
Tokineze all languages text into Zi.
support 300+ languages from wikipedia, including global
## use
* pip install ZiTokenizer
```python
from ZiTokenizer.ZiTokenizer import ZiTokenizer
# use
tokenizer = ZiTokenizer()
line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)😀熇'\x0000𧭏20222019\U0010ffff"
indexs = tokenizer.encode(line)
tokens = tokenizer.decode(indexs)
line2=tokenizer.tokens2line(tokens)
# build
demo/unit.py
```
## UnicodeTokenizer
basic tokeinzer
## ZiCutter
汉字拆字
> '瞼' -> ['⿰', '目', '僉']
## ZiSegmenter
word => prefix + root + suffix
> 'modernbritishdo' -> ['mod--', 'er--', 'n--', 'british', '--do']
## languages
default using "golabl" vocob, others from https://laohur.github.io/ZiTokenizer/index.html
> tokenizer = ZiTokenizer(vocab_dir)
Raw data
{
"_id": null,
"home_page": "https://github.com/laohur/ZiCutter",
"name": "ZiTokenizer",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.0",
"maintainer_email": "",
"keywords": "UnicodeTokenizer,ZiCutter,ZiTokenizer,Tokenizer,Unicode,laohur",
"author": "laohur",
"author_email": "laohur@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/48/f4/715a0232d01d4dddbe69da973313fcd34b71be165eee14624fb69c62bb2d/ZiTokenizer-0.0.8.tar.gz",
"platform": null,
"description": "# ZiTokenizer\r\n\r\nTokineze all languages text into Zi.\r\n\r\nsupport 300+ languages from wikipedia, including global \r\n\r\n\r\n## use\r\n* pip install ZiTokenizer\r\n\r\n```python\r\nfrom ZiTokenizer.ZiTokenizer import ZiTokenizer\r\n\r\n# use\r\ntokenizer = ZiTokenizer() \r\nline = \"\uf87f'\u3007\u33a1[\u0e04\u0e38\u0e13\u0e08\u0e30\u0e08\u0e31\u0e14\u0e1e\u0e34\u0e18\u0e35\u0e41\u0e15\u0e48\u0e07\u0e07\u0e32\u0e19\u0e40\u0e21\u0e37\u0e48\u0e2d\u0e44\u0e23\u0e04\u0e30\u0e31\u0e35\u0e34\u0e4c\u0e37\u0e47\u0e4d\u0e36]\u2167pays-g[ran]d-blanc-\u00e9lev\u00e9 \u00bb (\u767d\u9ad8\u5927\u590f\u570b)\ud83d\ude00\u7187'\\x0000\ud85e\udf4f2022\uff12\uff10\uff11\uff19\\U0010ffff\"\r\nindexs = tokenizer.encode(line)\r\ntokens = tokenizer.decode(indexs)\r\nline2=tokenizer.tokens2line(tokens)\r\n\r\n# build\r\ndemo/unit.py\r\n```\r\n\r\n## UnicodeTokenizer\r\nbasic tokeinzer\r\n\r\n## ZiCutter\r\n\u6c49\u5b57\u62c6\u5b57\r\n> '\u77bc' -> ['\u2ff0', '\u76ee', '\u50c9']\r\n\r\n## ZiSegmenter\r\nword => prefix + root + suffix\r\n> 'modernbritishdo' -> ['mod--', 'er--', 'n--', 'british', '--do']\r\n\r\n## languages\r\ndefault using \"golabl\" vocob, others from https://laohur.github.io/ZiTokenizer/index.html\r\n> tokenizer = ZiTokenizer(vocab_dir) \r\n",
"bugtrack_url": null,
"license": "[Anti-996 License](https: // github.com/996icu/996.ICU/blob/master/LICENSE)",
"summary": "ZiTokenizer: tokenize world text as Zi",
"version": "0.0.8",
"split_keywords": [
"unicodetokenizer",
"zicutter",
"zitokenizer",
"tokenizer",
"unicode",
"laohur"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e36d3ac57c119e6e5a1927be0680746dfdc222be528962859ac0d9108ac58331",
"md5": "ada3fca6fb3ac8f52e31123edc54bc7e",
"sha256": "a10b259a4681cc961caf67d209ea8c7038db59f4db9872a904acc10de9d15786"
},
"downloads": -1,
"filename": "ZiTokenizer-0.0.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ada3fca6fb3ac8f52e31123edc54bc7e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.0",
"size": 15207,
"upload_time": "2023-04-20T17:50:06",
"upload_time_iso_8601": "2023-04-20T17:50:06.991864Z",
"url": "https://files.pythonhosted.org/packages/e3/6d/3ac57c119e6e5a1927be0680746dfdc222be528962859ac0d9108ac58331/ZiTokenizer-0.0.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "48f4715a0232d01d4dddbe69da973313fcd34b71be165eee14624fb69c62bb2d",
"md5": "d6e44056b89fae47fdef28e70bfc2258",
"sha256": "801983c24bc1c860b0c7a7337e958a93fa0867a5a16ada7176ddc19e5915e9df"
},
"downloads": -1,
"filename": "ZiTokenizer-0.0.8.tar.gz",
"has_sig": false,
"md5_digest": "d6e44056b89fae47fdef28e70bfc2258",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.0",
"size": 14826,
"upload_time": "2023-04-20T17:50:08",
"upload_time_iso_8601": "2023-04-20T17:50:08.882967Z",
"url": "https://files.pythonhosted.org/packages/48/f4/715a0232d01d4dddbe69da973313fcd34b71be165eee14624fb69c62bb2d/ZiTokenizer-0.0.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-04-20 17:50:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "laohur",
"github_project": "ZiCutter",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "zitokenizer"
}