UnicodeTokenizer


NameUnicodeTokenizer JSON
Version 0.2.1 PyPI version JSON
download
home_pagehttps://github.com/laohur/UnicodeTokenizer
SummaryUnicodeTokenizer: tokenize all Unicode text
upload_time2023-09-20 21:46:37
maintainer
docs_urlNone
authorlaohur
requires_python>=3.0
license[Anti-996 License](https: // github.com/996icu/996.ICU/blob/master/LICENSE)
keywords unicodetokenizer tokenizer unicode zitokenizer zicutter laohur
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # UnicodeTokenizer

UnicodeTokenizer: tokenize all Unicode text 

## 切词规则 Tokenize Rules
* break line
* Punctuation
* UnicodeScripts
* Split(" ?[^(\\s|[.,!?…。,、।۔،])]+"
* break word

## use
> pip install UnicodeTokenizer

```python
from UnicodeTokenizer import UnicodeTokenizer
tokenizer=UnicodeTokenizer()

line = """ 
        首先8.88设置 st。art_new_word=True 和 output=[açaí],output 就是最终‘ no such name"
        的输出คุณจะจัดพิธีแต่งงานเมื่อไรคะ탑승 수속해야pneumonoultramicroscopicsilicovolcanoconiosis"
        하는데 카운터가 어디에 있어요ꆃꎭꆈꌠꊨꏦꏲꅉꆅꉚꅉꋍꂷꂶꌠلأحياء تمارين تتطلب من [MASK] [PAD] [CLS][SEP]
        est 𗴂𗹭𘜶𗴲𗂧, ou "phiow-bjij-lhjij-lhjij", ce que l'on peut traduire par « pays-grand-blanc-élevé » (白高大夏國). 
    """.strip()
print(tokenizer.tokenize(line))
print(tokenizer.split_lines(line))

```
or 
```bash
git clone https://github.com/laohur/UnicodeTokenizer
cd UnicodeTokenizer # modify 
pip install -e .
```


## reference
* PyICU https://gitlab.pyicu.org/main/pyicu
* tokenizers https://github.com/huggingface/tokenizers
* ICU-tokenizer https://github.com/mingruimingrui/ICU-tokenizer/tree/master


## License
[Anti-996 License](https://github.com/996icu/996.ICU/blob/master/LICENSE)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/laohur/UnicodeTokenizer",
    "name": "UnicodeTokenizer",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.0",
    "maintainer_email": "",
    "keywords": "UnicodeTokenizer,Tokenizer,Unicode,ZiTokenizer,ZiCutter,laohur",
    "author": "laohur",
    "author_email": "laohur@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/f6/a5/68de89c174308d4d267291aef1d884fc49d886b76d531fe99212d3ea5f32/UnicodeTokenizer-0.2.1.tar.gz",
    "platform": null,
    "description": "# UnicodeTokenizer\r\n\r\nUnicodeTokenizer: tokenize all Unicode text \r\n\r\n## \u5207\u8bcd\u89c4\u5219 Tokenize Rules\r\n* break line\r\n* Punctuation\r\n* UnicodeScripts\r\n* Split(\" ?[^(\\\\s|[.,!?\u2026\u3002\uff0c\u3001\u0964\u06d4\u060c])]+\"\r\n* break word\r\n\r\n## use\r\n> pip install UnicodeTokenizer\r\n\r\n```python\r\nfrom UnicodeTokenizer import UnicodeTokenizer\r\ntokenizer=UnicodeTokenizer()\r\n\r\nline = \"\"\" \uf87f\r\n        \u9996\u51488.88\u8bbe\u7f6e st\u3002art_new_word=True \u548c output=[a\u00e7a\u00ed]\uff0coutput \u5c31\u662f\u6700\u7ec8\uf87f\ued30\u0091 no such name\"\r\n        \u7684\u8f93\u51fa\u0e04\u0e38\u0e13\u0e08\u0e30\u0e08\u0e31\u0e14\u0e1e\u0e34\u0e18\u0e35\u0e41\u0e15\u0e48\u0e07\u0e07\u0e32\u0e19\u0e40\u0e21\u0e37\u0e48\u0e2d\u0e44\u0e23\u0e04\u0e30\ud0d1\uc2b9 \uc218\uc18d\ud574\uc57cpneumonoultramicroscopicsilicovolcanoconiosis\"\r\n        \ud558\ub294\ub370 \uce74\uc6b4\ud130\uac00 \uc5b4\ub514\uc5d0 \uc788\uc5b4\uc694\ua183\ua3ad\ua188\ua320\ua2a8\ua3e6\ua3f2\ua149\ua185\ua25a\ua149\ua2cd\ua0b7\ua0b6\ua320\u0644\u0623\u062d\u064a\u0627\u0621 \u062a\u0645\u0627\u0631\u064a\u0646 \u062a\u062a\u0637\u0644\u0628 \u0645\u0646 [MASK] [PAD] [CLS][SEP]\r\n        est \ud81f\udd02\ud81f\ude6d\ud821\udf36\ud81f\udd32\ud81c\udca7, ou \"phiow-bjij-lhjij-lhjij\", ce que l'on peut traduire par \u00ab pays-grand-blanc-\u00e9lev\u00e9 \u00bb (\u767d\u9ad8\u5927\u590f\u570b). \r\n    \"\"\".strip()\r\nprint(tokenizer.tokenize(line))\r\nprint(tokenizer.split_lines(line))\r\n\r\n```\r\nor \r\n```bash\r\ngit clone https://github.com/laohur/UnicodeTokenizer\r\ncd UnicodeTokenizer # modify \r\npip install -e .\r\n```\r\n\r\n\r\n## reference\r\n* PyICU https://gitlab.pyicu.org/main/pyicu\r\n* tokenizers https://github.com/huggingface/tokenizers\r\n* ICU-tokenizer https://github.com/mingruimingrui/ICU-tokenizer/tree/master\r\n\r\n\r\n## License\r\n[Anti-996 License](https://github.com/996icu/996.ICU/blob/master/LICENSE)\r\n",
    "bugtrack_url": null,
    "license": "[Anti-996 License](https: // github.com/996icu/996.ICU/blob/master/LICENSE)",
    "summary": "UnicodeTokenizer: tokenize all Unicode text",
    "version": "0.2.1",
    "project_urls": {
        "Homepage": "https://github.com/laohur/UnicodeTokenizer"
    },
    "split_keywords": [
        "unicodetokenizer",
        "tokenizer",
        "unicode",
        "zitokenizer",
        "zicutter",
        "laohur"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "25a2d98f1afc8ad08e088c11f9f03c9447c515881638f43a62092d58a425fbe6",
                "md5": "522970a1d2893e861578c1e49b884e31",
                "sha256": "cce580cd591193fe6048507a7653607b71d510ad248a93224163fd6002d1ca0e"
            },
            "downloads": -1,
            "filename": "UnicodeTokenizer-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "522970a1d2893e861578c1e49b884e31",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.0",
            "size": 3378,
            "upload_time": "2023-09-20T21:46:36",
            "upload_time_iso_8601": "2023-09-20T21:46:36.204039Z",
            "url": "https://files.pythonhosted.org/packages/25/a2/d98f1afc8ad08e088c11f9f03c9447c515881638f43a62092d58a425fbe6/UnicodeTokenizer-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f6a568de89c174308d4d267291aef1d884fc49d886b76d531fe99212d3ea5f32",
                "md5": "45381b7ffa502de618c1d5a3b61a5292",
                "sha256": "8cec0a3d4cac829ad77270bf9feef3975307b56e339ba8ee21d0fd00234595d4"
            },
            "downloads": -1,
            "filename": "UnicodeTokenizer-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "45381b7ffa502de618c1d5a3b61a5292",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.0",
            "size": 2729,
            "upload_time": "2023-09-20T21:46:37",
            "upload_time_iso_8601": "2023-09-20T21:46:37.997597Z",
            "url": "https://files.pythonhosted.org/packages/f6/a5/68de89c174308d4d267291aef1d884fc49d886b76d531fe99212d3ea5f32/UnicodeTokenizer-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-20 21:46:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "laohur",
    "github_project": "UnicodeTokenizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "unicodetokenizer"
}
        
Elapsed time: 0.11763s