# UnicodeTokenizer
UnicodeTokenizer: tokenize all Unicode text
## 切词规则 Tokenize Rules
* break line
* Punctuation
* UnicodeScripts
* Split(" ?[^(\\s|[.,!?…。,、।۔،])]+"
* break word
## use
> pip install UnicodeTokenizer
```python
from UnicodeTokenizer import UnicodeTokenizer
tokenizer=UnicodeTokenizer()
line = """
首先8.88设置 st。art_new_word=True 和 output=[açaí],output 就是最终 no such name"
的输出คุณจะจัดพิธีแต่งงานเมื่อไรคะ탑승 수속해야pneumonoultramicroscopicsilicovolcanoconiosis"
하는데 카운터가 어디에 있어요ꆃꎭꆈꌠꊨꏦꏲꅉꆅꉚꅉꋍꂷꂶꌠلأحياء تمارين تتطلب من [MASK] [PAD] [CLS][SEP]
est 𗴂𗹭𘜶𗴲𗂧, ou "phiow-bjij-lhjij-lhjij", ce que l'on peut traduire par « pays-grand-blanc-élevé » (白高大夏國).
""".strip()
print(tokenizer.tokenize(line))
print(tokenizer.split_lines(line))
```
or
```bash
git clone https://github.com/laohur/UnicodeTokenizer
cd UnicodeTokenizer # modify
pip install -e .
```
## reference
* PyICU https://gitlab.pyicu.org/main/pyicu
* tokenizers https://github.com/huggingface/tokenizers
* ICU-tokenizer https://github.com/mingruimingrui/ICU-tokenizer/tree/master
## License
[Anti-996 License](https://github.com/996icu/996.ICU/blob/master/LICENSE)
Raw data
{
"_id": null,
"home_page": "https://github.com/laohur/UnicodeTokenizer",
"name": "UnicodeTokenizer",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.0",
"maintainer_email": "",
"keywords": "UnicodeTokenizer,Tokenizer,Unicode,ZiTokenizer,ZiCutter,laohur",
"author": "laohur",
"author_email": "laohur@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/f6/a5/68de89c174308d4d267291aef1d884fc49d886b76d531fe99212d3ea5f32/UnicodeTokenizer-0.2.1.tar.gz",
"platform": null,
"description": "# UnicodeTokenizer\r\n\r\nUnicodeTokenizer: tokenize all Unicode text \r\n\r\n## \u5207\u8bcd\u89c4\u5219 Tokenize Rules\r\n* break line\r\n* Punctuation\r\n* UnicodeScripts\r\n* Split(\" ?[^(\\\\s|[.,!?\u2026\u3002\uff0c\u3001\u0964\u06d4\u060c])]+\"\r\n* break word\r\n\r\n## use\r\n> pip install UnicodeTokenizer\r\n\r\n```python\r\nfrom UnicodeTokenizer import UnicodeTokenizer\r\ntokenizer=UnicodeTokenizer()\r\n\r\nline = \"\"\" \uf87f\r\n \u9996\u51488.88\u8bbe\u7f6e st\u3002art_new_word=True \u548c output=[a\u00e7a\u00ed]\uff0coutput \u5c31\u662f\u6700\u7ec8\uf87f\ued30\u0091 no such name\"\r\n \u7684\u8f93\u51fa\u0e04\u0e38\u0e13\u0e08\u0e30\u0e08\u0e31\u0e14\u0e1e\u0e34\u0e18\u0e35\u0e41\u0e15\u0e48\u0e07\u0e07\u0e32\u0e19\u0e40\u0e21\u0e37\u0e48\u0e2d\u0e44\u0e23\u0e04\u0e30\ud0d1\uc2b9 \uc218\uc18d\ud574\uc57cpneumonoultramicroscopicsilicovolcanoconiosis\"\r\n \ud558\ub294\ub370 \uce74\uc6b4\ud130\uac00 \uc5b4\ub514\uc5d0 \uc788\uc5b4\uc694\ua183\ua3ad\ua188\ua320\ua2a8\ua3e6\ua3f2\ua149\ua185\ua25a\ua149\ua2cd\ua0b7\ua0b6\ua320\u0644\u0623\u062d\u064a\u0627\u0621 \u062a\u0645\u0627\u0631\u064a\u0646 \u062a\u062a\u0637\u0644\u0628 \u0645\u0646 [MASK] [PAD] [CLS][SEP]\r\n est \ud81f\udd02\ud81f\ude6d\ud821\udf36\ud81f\udd32\ud81c\udca7, ou \"phiow-bjij-lhjij-lhjij\", ce que l'on peut traduire par \u00ab pays-grand-blanc-\u00e9lev\u00e9 \u00bb (\u767d\u9ad8\u5927\u590f\u570b). \r\n \"\"\".strip()\r\nprint(tokenizer.tokenize(line))\r\nprint(tokenizer.split_lines(line))\r\n\r\n```\r\nor \r\n```bash\r\ngit clone https://github.com/laohur/UnicodeTokenizer\r\ncd UnicodeTokenizer # modify \r\npip install -e .\r\n```\r\n\r\n\r\n## reference\r\n* PyICU https://gitlab.pyicu.org/main/pyicu\r\n* tokenizers https://github.com/huggingface/tokenizers\r\n* ICU-tokenizer https://github.com/mingruimingrui/ICU-tokenizer/tree/master\r\n\r\n\r\n## License\r\n[Anti-996 License](https://github.com/996icu/996.ICU/blob/master/LICENSE)\r\n",
"bugtrack_url": null,
"license": "[Anti-996 License](https: // github.com/996icu/996.ICU/blob/master/LICENSE)",
"summary": "UnicodeTokenizer: tokenize all Unicode text",
"version": "0.2.1",
"project_urls": {
"Homepage": "https://github.com/laohur/UnicodeTokenizer"
},
"split_keywords": [
"unicodetokenizer",
"tokenizer",
"unicode",
"zitokenizer",
"zicutter",
"laohur"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "25a2d98f1afc8ad08e088c11f9f03c9447c515881638f43a62092d58a425fbe6",
"md5": "522970a1d2893e861578c1e49b884e31",
"sha256": "cce580cd591193fe6048507a7653607b71d510ad248a93224163fd6002d1ca0e"
},
"downloads": -1,
"filename": "UnicodeTokenizer-0.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "522970a1d2893e861578c1e49b884e31",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.0",
"size": 3378,
"upload_time": "2023-09-20T21:46:36",
"upload_time_iso_8601": "2023-09-20T21:46:36.204039Z",
"url": "https://files.pythonhosted.org/packages/25/a2/d98f1afc8ad08e088c11f9f03c9447c515881638f43a62092d58a425fbe6/UnicodeTokenizer-0.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f6a568de89c174308d4d267291aef1d884fc49d886b76d531fe99212d3ea5f32",
"md5": "45381b7ffa502de618c1d5a3b61a5292",
"sha256": "8cec0a3d4cac829ad77270bf9feef3975307b56e339ba8ee21d0fd00234595d4"
},
"downloads": -1,
"filename": "UnicodeTokenizer-0.2.1.tar.gz",
"has_sig": false,
"md5_digest": "45381b7ffa502de618c1d5a3b61a5292",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.0",
"size": 2729,
"upload_time": "2023-09-20T21:46:37",
"upload_time_iso_8601": "2023-09-20T21:46:37.997597Z",
"url": "https://files.pythonhosted.org/packages/f6/a5/68de89c174308d4d267291aef1d884fc49d886b76d531fe99212d3ea5f32/UnicodeTokenizer-0.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-20 21:46:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "laohur",
"github_project": "UnicodeTokenizer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "unicodetokenizer"
}