simple-sentencepiece


Namesimple-sentencepiece JSON
Version 0.6 PyPI version JSON
download
home_pageNone
SummaryA simple sentencepiece encoder and decoder without any dependency.
upload_time2025-07-22 11:45:09
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # simple-sentencepiece
A simple sentencepiece encoder and decoder.

Note: This is not a new sentencepiece toolkit, it just uses google's sentencepiece model
as input and encode the string to ids/pieces or decode the ids to string. The advantage of
this tool is that it doesn't have any dependency (no protobuf), so it will be easier to
integrate it into a C++ project.


## Installation

```
pip install simple-sentencepiece
```


## Usage

The usage is very similar to sentencepiece, it also has `encode` and `decode` interface.

```python
from ssentencepiece import Ssentencepiece

# you can get bpe.vocab from a trained bpe model, see google's sentencepiece for details
ssp = Ssentencepiece("/path/to/bpe.vocab")

# you can also use the default models provided by this package, see below for details
# ssp = Ssentencepiece("gigaspeech-500")

# output ids
ids = ssp.encode(["HELLO WORLD", "LOVE AND PIECE"])

# output string pieces
pieces = ssp.encode(["HELLO WORLD", "LOVE AND PIECE"], out_type=str)

# decode
res = ssp.decode(ids)
```

## Default models

| Model Name | Description | Link |
|------------|-------------|------|
| alphabet-32| `<blk>`,`<unk>`, `<sos>`, `<eos>`, `'`, `▁` and 26 alphabets. | [alphabet-32](ssentencepiece/python/ssentencepiece/resources/alphabet-32.vocab) |
| librispeech-500| 500 unigram pieces trained on Librispeech. | [librispeech-500](ssentencepiece/python/ssentencepiece/resources/librispeech-500.vocab) |
| librispeech-5000| 5000 unigram pieces trained on Librispeech. | [librispeech-5000](ssentencepiece/python/ssentencepiece/resources/librispeech-5000.vocab) |
| gigaspeech-500| 500 unigram pieces trained on Gigaspeech. | [gigaspeech-500](ssentencepiece/python/ssentencepiece/resources/gigaspeech-500.vocab) |
| gigaspeech-2000| 2000 unigram pieces trained on Gigaspeech. | [gigaspeech-2000](ssentencepiece/python/ssentencepiece/resources/gigaspeech-2000.vocab) |
| gigaspeech-5000| 5000 unigram pieces trained on Gigaspeech. | [gigaspeech-5000](ssentencepiece/python/ssentencepiece/resources/gigaspeech-5000.vocab) |
| zh-en-3876 | 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | [zh-en-3876](ssentencepiece/python/ssentencepiece/resources/zh-en-3876.vocab) |
| zh-en-6876 | 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | [zh-en-6876](ssentencepiece/python/ssentencepiece/resources/zh-en-6876.vocab) |
| zh-en-8481 | 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | [zh-en-8481](ssentencepiece/python/ssentencepiece/resources/zh-en-8481.vocab) |
| zh-en-5776 | 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | [zh-en-5776](ssentencepiece/python/ssentencepiece/resources/zh-en-5776.vocab) |
| zh-en-8776 | 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | [zh-en-8776](ssentencepiece/python/ssentencepiece/resources/zh-en-8776.vocab) |
| zh-en-10381 | 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | [zh-en-10381](ssentencepiece/python/ssentencepiece/resources/zh-en-10381.vocab) |
| zh-en-yue-9761 | 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | [zh-en-yue-9761](ssentencepiece/python/ssentencepiece/resources/zh-en-yue-9761.vocab) |
| zh-en-yue-11661 | 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | [zh-en-yue-11661](ssentencepiece/python/ssentencepiece/resources/zh-en-yue-11661.vocab) |

**Note**: The number of 3500, 6500 and 8105 is from [通用规范汉字表](http://www.moe.gov.cn/jyb_sjzl/ziliao/A19/201306/t20130601_186002.html).


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "simple-sentencepiece",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Next-gen Kaldi development team <wkang@pku.edu.cn>",
    "download_url": "https://files.pythonhosted.org/packages/ef/f7/dd46656f2a31c7edbbcaa77224e5278ee850d53941561bcdeea6f2c6d80a/simple_sentencepiece-0.6.tar.gz",
    "platform": null,
    "description": "# simple-sentencepiece\nA simple sentencepiece encoder and decoder.\n\nNote: This is not a new sentencepiece toolkit, it just uses google's sentencepiece model\nas input and encode the string to ids/pieces or decode the ids to string. The advantage of\nthis tool is that it doesn't have any dependency (no protobuf), so it will be easier to\nintegrate it into a C++ project.\n\n\n## Installation\n\n```\npip install simple-sentencepiece\n```\n\n\n## Usage\n\nThe usage is very similar to sentencepiece, it also has `encode` and `decode` interface.\n\n```python\nfrom ssentencepiece import Ssentencepiece\n\n# you can get bpe.vocab from a trained bpe model, see google's sentencepiece for details\nssp = Ssentencepiece(\"/path/to/bpe.vocab\")\n\n# you can also use the default models provided by this package, see below for details\n# ssp = Ssentencepiece(\"gigaspeech-500\")\n\n# output ids\nids = ssp.encode([\"HELLO WORLD\", \"LOVE AND PIECE\"])\n\n# output string pieces\npieces = ssp.encode([\"HELLO WORLD\", \"LOVE AND PIECE\"], out_type=str)\n\n# decode\nres = ssp.decode(ids)\n```\n\n## Default models\n\n| Model Name | Description | Link |\n|------------|-------------|------|\n| alphabet-32| `<blk>`,`<unk>`, `<sos>`, `<eos>`, `'`, `\u2581` and 26 alphabets. | [alphabet-32](ssentencepiece/python/ssentencepiece/resources/alphabet-32.vocab) |\n| librispeech-500| 500 unigram pieces trained on Librispeech. | [librispeech-500](ssentencepiece/python/ssentencepiece/resources/librispeech-500.vocab) |\n| librispeech-5000| 5000 unigram pieces trained on Librispeech. | [librispeech-5000](ssentencepiece/python/ssentencepiece/resources/librispeech-5000.vocab) |\n| gigaspeech-500| 500 unigram pieces trained on Gigaspeech. | [gigaspeech-500](ssentencepiece/python/ssentencepiece/resources/gigaspeech-500.vocab) |\n| gigaspeech-2000| 2000 unigram pieces trained on Gigaspeech. | [gigaspeech-2000](ssentencepiece/python/ssentencepiece/resources/gigaspeech-2000.vocab) |\n| gigaspeech-5000| 5000 unigram pieces trained on Gigaspeech. | [gigaspeech-5000](ssentencepiece/python/ssentencepiece/resources/gigaspeech-5000.vocab) |\n| zh-en-3876 | 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | [zh-en-3876](ssentencepiece/python/ssentencepiece/resources/zh-en-3876.vocab) |\n| zh-en-6876 | 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | [zh-en-6876](ssentencepiece/python/ssentencepiece/resources/zh-en-6876.vocab) |\n| zh-en-8481 | 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | [zh-en-8481](ssentencepiece/python/ssentencepiece/resources/zh-en-8481.vocab) |\n| zh-en-5776 | 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | [zh-en-5776](ssentencepiece/python/ssentencepiece/resources/zh-en-5776.vocab) |\n| zh-en-8776 | 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | [zh-en-8776](ssentencepiece/python/ssentencepiece/resources/zh-en-8776.vocab) |\n| zh-en-10381 | 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | [zh-en-10381](ssentencepiece/python/ssentencepiece/resources/zh-en-10381.vocab) |\n| zh-en-yue-9761 | 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | [zh-en-yue-9761](ssentencepiece/python/ssentencepiece/resources/zh-en-yue-9761.vocab) |\n| zh-en-yue-11661 | 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | [zh-en-yue-11661](ssentencepiece/python/ssentencepiece/resources/zh-en-yue-11661.vocab) |\n\n**Note**: The number of 3500, 6500 and 8105 is from [\u901a\u7528\u89c4\u8303\u6c49\u5b57\u8868](http://www.moe.gov.cn/jyb_sjzl/ziliao/A19/201306/t20130601_186002.html).\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A simple sentencepiece encoder and decoder without any dependency.",
    "version": "0.6",
    "project_urls": {
        "Bug Tracker": "https://github.com/pkufool/simple-sentencepiece/issues",
        "Homepage": "https://github.com/pkufool/simple-sentencepiece"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "773bcf245c6522738ba87f513a14d31fde9b0cc17d3652337d08b8ea70ba98d2",
                "md5": "51c9762366564da54ed9c59b86ce9c2f",
                "sha256": "6caff385efd41bb3cee2d49a710e8d2dfd8d89b50002fa6edfe925ddb6490b23"
            },
            "downloads": -1,
            "filename": "simple_sentencepiece-0.6-cp310-cp310-macosx_10_9_x86_64.whl",
            "has_sig": false,
            "md5_digest": "51c9762366564da54ed9c59b86ce9c2f",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.8",
            "size": 469954,
            "upload_time": "2025-07-22T11:56:36",
            "upload_time_iso_8601": "2025-07-22T11:56:36.097853Z",
            "url": "https://files.pythonhosted.org/packages/77/3b/cf245c6522738ba87f513a14d31fde9b0cc17d3652337d08b8ea70ba98d2/simple_sentencepiece-0.6-cp310-cp310-macosx_10_9_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "734300126453be54f10ecae8ca85a4190848ae457fde65365471fbc1894bc306",
                "md5": "ef9c7cf3ceb7f177069ab210b9ec0d74",
                "sha256": "7584dfec633155f6670cef2cae5b4c38c9b164ad0133971bdac3bc6291c1a1bf"
            },
            "downloads": -1,
            "filename": "simple_sentencepiece-0.6-cp311-cp311-macosx_10_9_x86_64.whl",
            "has_sig": false,
            "md5_digest": "ef9c7cf3ceb7f177069ab210b9ec0d74",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.8",
            "size": 471488,
            "upload_time": "2025-07-22T11:56:37",
            "upload_time_iso_8601": "2025-07-22T11:56:37.169691Z",
            "url": "https://files.pythonhosted.org/packages/73/43/00126453be54f10ecae8ca85a4190848ae457fde65365471fbc1894bc306/simple_sentencepiece-0.6-cp311-cp311-macosx_10_9_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "89700eac3bf4f8b56373cd75d77f28ea1049add9b1458e20be0120c69b8d04e3",
                "md5": "bb314f1975c09bfef6aa419413413326",
                "sha256": "f8c337e0cb97291512ffdad6ddb1fef2fecd25e0fed7cc38e412ee053f4160b2"
            },
            "downloads": -1,
            "filename": "simple_sentencepiece-0.6-cp312-cp312-macosx_10_13_x86_64.whl",
            "has_sig": false,
            "md5_digest": "bb314f1975c09bfef6aa419413413326",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.8",
            "size": 470109,
            "upload_time": "2025-07-22T11:56:38",
            "upload_time_iso_8601": "2025-07-22T11:56:38.153523Z",
            "url": "https://files.pythonhosted.org/packages/89/70/0eac3bf4f8b56373cd75d77f28ea1049add9b1458e20be0120c69b8d04e3/simple_sentencepiece-0.6-cp312-cp312-macosx_10_13_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3db2efdaa217c90c518044a5a477a80d8595ff3c6e6482c30805a51ff0c41f3f",
                "md5": "5f25a0296b14978407a3d09b8a4e8e13",
                "sha256": "de33a6f7507c73fff156bf515f95d6c698e30b57c8c1a0d5cc14d93fb3ba3d96"
            },
            "downloads": -1,
            "filename": "simple_sentencepiece-0.6-cp313-cp313-macosx_10_13_x86_64.whl",
            "has_sig": false,
            "md5_digest": "5f25a0296b14978407a3d09b8a4e8e13",
            "packagetype": "bdist_wheel",
            "python_version": "cp313",
            "requires_python": ">=3.8",
            "size": 470122,
            "upload_time": "2025-07-22T11:56:39",
            "upload_time_iso_8601": "2025-07-22T11:56:39.066437Z",
            "url": "https://files.pythonhosted.org/packages/3d/b2/efdaa217c90c518044a5a477a80d8595ff3c6e6482c30805a51ff0c41f3f/simple_sentencepiece-0.6-cp313-cp313-macosx_10_13_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fc865827c509d6c7292769f13a992a6e90d2db0d5c90dce705af2a7283931f8e",
                "md5": "53ac74885077a8a0dac0145b319be846",
                "sha256": "2af2dd4ebc05cea829bba49e6117a4f96e9a7f77eda7452ee392e6e32007fbfe"
            },
            "downloads": -1,
            "filename": "simple_sentencepiece-0.6-cp38-cp38-macosx_10_9_x86_64.whl",
            "has_sig": false,
            "md5_digest": "53ac74885077a8a0dac0145b319be846",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.8",
            "size": 469638,
            "upload_time": "2025-07-22T11:56:40",
            "upload_time_iso_8601": "2025-07-22T11:56:40.083961Z",
            "url": "https://files.pythonhosted.org/packages/fc/86/5827c509d6c7292769f13a992a6e90d2db0d5c90dce705af2a7283931f8e/simple_sentencepiece-0.6-cp38-cp38-macosx_10_9_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "08e4e078f1176d44efd91817ee4ba345d49f3cbb74aa53c482c4d7bfdb1afb26",
                "md5": "d91c8ffece3997ddc0528d6c300eb3cc",
                "sha256": "907b34063572dfcb37842d20010164742f159bbd15fd87821c4ea567112f6418"
            },
            "downloads": -1,
            "filename": "simple_sentencepiece-0.6-cp39-cp39-macosx_10_9_x86_64.whl",
            "has_sig": false,
            "md5_digest": "d91c8ffece3997ddc0528d6c300eb3cc",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.8",
            "size": 470036,
            "upload_time": "2025-07-22T11:56:41",
            "upload_time_iso_8601": "2025-07-22T11:56:41.537732Z",
            "url": "https://files.pythonhosted.org/packages/08/e4/e078f1176d44efd91817ee4ba345d49f3cbb74aa53c482c4d7bfdb1afb26/simple_sentencepiece-0.6-cp39-cp39-macosx_10_9_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "eff7dd46656f2a31c7edbbcaa77224e5278ee850d53941561bcdeea6f2c6d80a",
                "md5": "8e7f7006e7e219c2825bae122b8f81a5",
                "sha256": "e184c69bc04c1148bbdb8b787307afc19228aeca3a49cbc7cf561debb2fafb61"
            },
            "downloads": -1,
            "filename": "simple_sentencepiece-0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "8e7f7006e7e219c2825bae122b8f81a5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 849460,
            "upload_time": "2025-07-22T11:45:09",
            "upload_time_iso_8601": "2025-07-22T11:45:09.274445Z",
            "url": "https://files.pythonhosted.org/packages/ef/f7/dd46656f2a31c7edbbcaa77224e5278ee850d53941561bcdeea6f2c6d80a/simple_sentencepiece-0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-22 11:45:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "pkufool",
    "github_project": "simple-sentencepiece",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "simple-sentencepiece"
}
        
Elapsed time: 1.15817s