rs-bytepiece


Namers-bytepiece JSON
Version 0.2.2 PyPI version JSON
download
home_pageNone
Summarybytepiece-rs Python binding
upload_time2023-11-12 08:52:37
maintainerNone
docs_urlNone
authorYam(长琴) <haoshaochun@gmail.com>
requires_python>=3.7
licenseMIT
keywords nlp tokenizer bytepiece deep learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # rs-bytepiece

## Install

```bash
pip install rs_bytepiece
```

## Usage

```python
from rs_bytepiece import Tokenizer

tokenizer = Tokenizer()
# a custom model
tokenizer = Tokenizer("/path/to/model")
ids = tokenizer.encode("今天天气不错")
text = tokenizer.decode(ids)
```

## Performance

The performance is a bit faster than the original implementation. I've tested (on my M2 16G) the《鲁迅全集》which has 625890 chars. The time unit is millisecond.

| length | jieba    | aho_py  | aho_cy | aho_rs |
| ------ | -------- | ------- | ------ | ------ |
| 100    | 17062.12 | 1404.37 | 564.31 | 112.94 |
| 1000   | 17104.38 | 1424.6  | 573.32 | 113.18 |
| 10000  | 17432.58 | 1429.0  | 574.93 | 110.03 |
| 100000 | 17228.17 | 1401.01 | 574.5  | 110.44 |
| 625890 | 17305.95 | 1419.79 | 567.78 | 108.54  |



            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "rs-bytepiece",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "NLP,tokenizer,bytepiece,Deep Learning",
    "author": "Yam(\u957f\u7434) <haoshaochun@gmail.com>",
    "author_email": "Yam(\u957f\u7434) <haoshaochun@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/8a/22/b6bbac87677550e256e049e97b2cc4a75aa33bd12bb81ac7c4709d407178/rs_bytepiece-0.2.2.tar.gz",
    "platform": null,
    "description": "# rs-bytepiece\n\n## Install\n\n```bash\npip install rs_bytepiece\n```\n\n## Usage\n\n```python\nfrom rs_bytepiece import Tokenizer\n\ntokenizer = Tokenizer()\n# a custom model\ntokenizer = Tokenizer(\"/path/to/model\")\nids = tokenizer.encode(\"\u4eca\u5929\u5929\u6c14\u4e0d\u9519\")\ntext = tokenizer.decode(ids)\n```\n\n## Performance\n\nThe performance is a bit faster than the original implementation. I've tested (on my M2 16G) the\u300a\u9c81\u8fc5\u5168\u96c6\u300bwhich has 625890 chars. The time unit is millisecond.\n\n| length | jieba    | aho_py  | aho_cy | aho_rs |\n| ------ | -------- | ------- | ------ | ------ |\n| 100    | 17062.12 | 1404.37 | 564.31 | 112.94 |\n| 1000   | 17104.38 | 1424.6  | 573.32 | 113.18 |\n| 10000  | 17432.58 | 1429.0  | 574.93 | 110.03 |\n| 100000 | 17228.17 | 1401.01 | 574.5  | 110.44 |\n| 625890 | 17305.95 | 1419.79 | 567.78 | 108.54  |\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "bytepiece-rs Python binding",
    "version": "0.2.2",
    "project_urls": {
        "documentation": "https://github.com/hscspring/bytepiece-rs",
        "homepage": "https://github.com/hscspring/bytepiece-rs",
        "repository": "https://github.com/hscspring/bytepiece-rs"
    },
    "split_keywords": [
        "nlp",
        "tokenizer",
        "bytepiece",
        "deep learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4863809ac38242cf82098144777c8f82a94290d15aa04f9c45dbb57da7a78be8",
                "md5": "485620da2afd64f4758720e3ab7704aa",
                "sha256": "14338a8c8573df2ac4dc83567a58722d18984f678ce3426237d58e689288df1d"
            },
            "downloads": -1,
            "filename": "rs_bytepiece-0.2.2-cp37-abi3-macosx_10_7_x86_64.whl",
            "has_sig": false,
            "md5_digest": "485620da2afd64f4758720e3ab7704aa",
            "packagetype": "bdist_wheel",
            "python_version": "cp37",
            "requires_python": ">=3.7",
            "size": 2295816,
            "upload_time": "2023-11-12T08:52:19",
            "upload_time_iso_8601": "2023-11-12T08:52:19.642411Z",
            "url": "https://files.pythonhosted.org/packages/48/63/809ac38242cf82098144777c8f82a94290d15aa04f9c45dbb57da7a78be8/rs_bytepiece-0.2.2-cp37-abi3-macosx_10_7_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "14610a22cb90c845829383640d2a9ef17ab0286e9cef9189b7346194b5df2b72",
                "md5": "52a51aebe3216eadb80bafb8150db11a",
                "sha256": "2405a38e0a03985fabeb025eae65fcafcf5aae7c5f9b065c10cc17b72ae5e723"
            },
            "downloads": -1,
            "filename": "rs_bytepiece-0.2.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "52a51aebe3216eadb80bafb8150db11a",
            "packagetype": "bdist_wheel",
            "python_version": "cp37",
            "requires_python": ">=3.7",
            "size": 3463328,
            "upload_time": "2023-11-12T08:52:26",
            "upload_time_iso_8601": "2023-11-12T08:52:26.347012Z",
            "url": "https://files.pythonhosted.org/packages/14/61/0a22cb90c845829383640d2a9ef17ab0286e9cef9189b7346194b5df2b72/rs_bytepiece-0.2.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6230dea969abe55a7cf936c69b32b5ddda7eb09687300be9fa8923535da0d9f6",
                "md5": "a20df1d801d72386db90a6e23dfc9efe",
                "sha256": "2e6a9f8b78bc24d4856b00240c82392ebe6008b110ba482e22c00f48f81b4e6c"
            },
            "downloads": -1,
            "filename": "rs_bytepiece-0.2.2-cp37-abi3-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "a20df1d801d72386db90a6e23dfc9efe",
            "packagetype": "bdist_wheel",
            "python_version": "cp37",
            "requires_python": ">=3.7",
            "size": 3819501,
            "upload_time": "2023-11-12T08:52:32",
            "upload_time_iso_8601": "2023-11-12T08:52:32.319579Z",
            "url": "https://files.pythonhosted.org/packages/62/30/dea969abe55a7cf936c69b32b5ddda7eb09687300be9fa8923535da0d9f6/rs_bytepiece-0.2.2-cp37-abi3-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8a22b6bbac87677550e256e049e97b2cc4a75aa33bd12bb81ac7c4709d407178",
                "md5": "5b7c4bd2fb2e597f4088d5b1a33bfb9f",
                "sha256": "f86206004808f118d7581fe314387c6d0c45566cd34fda346c439c3392e912cd"
            },
            "downloads": -1,
            "filename": "rs_bytepiece-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "5b7c4bd2fb2e597f4088d5b1a33bfb9f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 1241456,
            "upload_time": "2023-11-12T08:52:37",
            "upload_time_iso_8601": "2023-11-12T08:52:37.611552Z",
            "url": "https://files.pythonhosted.org/packages/8a/22/b6bbac87677550e256e049e97b2cc4a75aa33bd12bb81ac7c4709d407178/rs_bytepiece-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-12 08:52:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hscspring",
    "github_project": "bytepiece-rs",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "rs-bytepiece"
}
        
Elapsed time: 0.13859s