ZiTokenizer


NameZiTokenizer JSON
Version 0.0.8 PyPI version JSON
download
home_pagehttps://github.com/laohur/ZiCutter
SummaryZiTokenizer: tokenize world text as Zi
upload_time2023-04-20 17:50:08
maintainer
docs_urlNone
authorlaohur
requires_python>=3.0
license[Anti-996 License](https: // github.com/996icu/996.ICU/blob/master/LICENSE)
keywords unicodetokenizer zicutter zitokenizer tokenizer unicode laohur
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ZiTokenizer

Tokineze all languages text into Zi.

support 300+ languages from wikipedia, including global 


## use
* pip install ZiTokenizer

```python
from ZiTokenizer.ZiTokenizer import ZiTokenizer

# use
tokenizer = ZiTokenizer()  
line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)😀熇'\x0000𧭏20222019\U0010ffff"
indexs = tokenizer.encode(line)
tokens = tokenizer.decode(indexs)
line2=tokenizer.tokens2line(tokens)

# build
demo/unit.py
```

## UnicodeTokenizer
basic tokeinzer

## ZiCutter
汉字拆字
> '瞼' -> ['⿰', '目', '僉']

## ZiSegmenter
word => prefix + root + suffix
> 'modernbritishdo' -> ['mod--', 'er--', 'n--', 'british', '--do']

## languages
default using "golabl" vocob, others from https://laohur.github.io/ZiTokenizer/index.html
> tokenizer = ZiTokenizer(vocab_dir)  

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/laohur/ZiCutter",
    "name": "ZiTokenizer",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.0",
    "maintainer_email": "",
    "keywords": "UnicodeTokenizer,ZiCutter,ZiTokenizer,Tokenizer,Unicode,laohur",
    "author": "laohur",
    "author_email": "laohur@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/48/f4/715a0232d01d4dddbe69da973313fcd34b71be165eee14624fb69c62bb2d/ZiTokenizer-0.0.8.tar.gz",
    "platform": null,
    "description": "# ZiTokenizer\r\n\r\nTokineze all languages text into Zi.\r\n\r\nsupport 300+ languages from wikipedia, including global \r\n\r\n\r\n## use\r\n* pip install ZiTokenizer\r\n\r\n```python\r\nfrom ZiTokenizer.ZiTokenizer import ZiTokenizer\r\n\r\n# use\r\ntokenizer = ZiTokenizer()  \r\nline = \"\uf87f'\u3007\u33a1[\u0e04\u0e38\u0e13\u0e08\u0e30\u0e08\u0e31\u0e14\u0e1e\u0e34\u0e18\u0e35\u0e41\u0e15\u0e48\u0e07\u0e07\u0e32\u0e19\u0e40\u0e21\u0e37\u0e48\u0e2d\u0e44\u0e23\u0e04\u0e30\u0e31\u0e35\u0e34\u0e4c\u0e37\u0e47\u0e4d\u0e36]\u2167pays-g[ran]d-blanc-\u00e9lev\u00e9 \u00bb (\u767d\u9ad8\u5927\u590f\u570b)\ud83d\ude00\u7187'\\x0000\ud85e\udf4f2022\uff12\uff10\uff11\uff19\\U0010ffff\"\r\nindexs = tokenizer.encode(line)\r\ntokens = tokenizer.decode(indexs)\r\nline2=tokenizer.tokens2line(tokens)\r\n\r\n# build\r\ndemo/unit.py\r\n```\r\n\r\n## UnicodeTokenizer\r\nbasic tokeinzer\r\n\r\n## ZiCutter\r\n\u6c49\u5b57\u62c6\u5b57\r\n> '\u77bc' -> ['\u2ff0', '\u76ee', '\u50c9']\r\n\r\n## ZiSegmenter\r\nword => prefix + root + suffix\r\n> 'modernbritishdo' -> ['mod--', 'er--', 'n--', 'british', '--do']\r\n\r\n## languages\r\ndefault using \"golabl\" vocob, others from https://laohur.github.io/ZiTokenizer/index.html\r\n> tokenizer = ZiTokenizer(vocab_dir)  \r\n",
    "bugtrack_url": null,
    "license": "[Anti-996 License](https: // github.com/996icu/996.ICU/blob/master/LICENSE)",
    "summary": "ZiTokenizer: tokenize world text as Zi",
    "version": "0.0.8",
    "split_keywords": [
        "unicodetokenizer",
        "zicutter",
        "zitokenizer",
        "tokenizer",
        "unicode",
        "laohur"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e36d3ac57c119e6e5a1927be0680746dfdc222be528962859ac0d9108ac58331",
                "md5": "ada3fca6fb3ac8f52e31123edc54bc7e",
                "sha256": "a10b259a4681cc961caf67d209ea8c7038db59f4db9872a904acc10de9d15786"
            },
            "downloads": -1,
            "filename": "ZiTokenizer-0.0.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ada3fca6fb3ac8f52e31123edc54bc7e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.0",
            "size": 15207,
            "upload_time": "2023-04-20T17:50:06",
            "upload_time_iso_8601": "2023-04-20T17:50:06.991864Z",
            "url": "https://files.pythonhosted.org/packages/e3/6d/3ac57c119e6e5a1927be0680746dfdc222be528962859ac0d9108ac58331/ZiTokenizer-0.0.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "48f4715a0232d01d4dddbe69da973313fcd34b71be165eee14624fb69c62bb2d",
                "md5": "d6e44056b89fae47fdef28e70bfc2258",
                "sha256": "801983c24bc1c860b0c7a7337e958a93fa0867a5a16ada7176ddc19e5915e9df"
            },
            "downloads": -1,
            "filename": "ZiTokenizer-0.0.8.tar.gz",
            "has_sig": false,
            "md5_digest": "d6e44056b89fae47fdef28e70bfc2258",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.0",
            "size": 14826,
            "upload_time": "2023-04-20T17:50:08",
            "upload_time_iso_8601": "2023-04-20T17:50:08.882967Z",
            "url": "https://files.pythonhosted.org/packages/48/f4/715a0232d01d4dddbe69da973313fcd34b71be165eee14624fb69c62bb2d/ZiTokenizer-0.0.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-20 17:50:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "laohur",
    "github_project": "ZiCutter",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "zitokenizer"
}
        
Elapsed time: 0.06773s