Name | simple-sentencepiece JSON |
Version |
0.6
JSON |
| download |
home_page | None |
Summary | A simple sentencepiece encoder and decoder without any dependency. |
upload_time | 2025-07-22 11:45:09 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.8 |
license | None |
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# simple-sentencepiece
A simple sentencepiece encoder and decoder.
Note: This is not a new sentencepiece toolkit, it just uses google's sentencepiece model
as input and encode the string to ids/pieces or decode the ids to string. The advantage of
this tool is that it doesn't have any dependency (no protobuf), so it will be easier to
integrate it into a C++ project.
## Installation
```
pip install simple-sentencepiece
```
## Usage
The usage is very similar to sentencepiece, it also has `encode` and `decode` interface.
```python
from ssentencepiece import Ssentencepiece
# you can get bpe.vocab from a trained bpe model, see google's sentencepiece for details
ssp = Ssentencepiece("/path/to/bpe.vocab")
# you can also use the default models provided by this package, see below for details
# ssp = Ssentencepiece("gigaspeech-500")
# output ids
ids = ssp.encode(["HELLO WORLD", "LOVE AND PIECE"])
# output string pieces
pieces = ssp.encode(["HELLO WORLD", "LOVE AND PIECE"], out_type=str)
# decode
res = ssp.decode(ids)
```
## Default models
| Model Name | Description | Link |
|------------|-------------|------|
| alphabet-32| `<blk>`,`<unk>`, `<sos>`, `<eos>`, `'`, `▁` and 26 alphabets. | [alphabet-32](ssentencepiece/python/ssentencepiece/resources/alphabet-32.vocab) |
| librispeech-500| 500 unigram pieces trained on Librispeech. | [librispeech-500](ssentencepiece/python/ssentencepiece/resources/librispeech-500.vocab) |
| librispeech-5000| 5000 unigram pieces trained on Librispeech. | [librispeech-5000](ssentencepiece/python/ssentencepiece/resources/librispeech-5000.vocab) |
| gigaspeech-500| 500 unigram pieces trained on Gigaspeech. | [gigaspeech-500](ssentencepiece/python/ssentencepiece/resources/gigaspeech-500.vocab) |
| gigaspeech-2000| 2000 unigram pieces trained on Gigaspeech. | [gigaspeech-2000](ssentencepiece/python/ssentencepiece/resources/gigaspeech-2000.vocab) |
| gigaspeech-5000| 5000 unigram pieces trained on Gigaspeech. | [gigaspeech-5000](ssentencepiece/python/ssentencepiece/resources/gigaspeech-5000.vocab) |
| zh-en-3876 | 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | [zh-en-3876](ssentencepiece/python/ssentencepiece/resources/zh-en-3876.vocab) |
| zh-en-6876 | 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | [zh-en-6876](ssentencepiece/python/ssentencepiece/resources/zh-en-6876.vocab) |
| zh-en-8481 | 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | [zh-en-8481](ssentencepiece/python/ssentencepiece/resources/zh-en-8481.vocab) |
| zh-en-5776 | 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | [zh-en-5776](ssentencepiece/python/ssentencepiece/resources/zh-en-5776.vocab) |
| zh-en-8776 | 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | [zh-en-8776](ssentencepiece/python/ssentencepiece/resources/zh-en-8776.vocab) |
| zh-en-10381 | 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | [zh-en-10381](ssentencepiece/python/ssentencepiece/resources/zh-en-10381.vocab) |
| zh-en-yue-9761 | 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | [zh-en-yue-9761](ssentencepiece/python/ssentencepiece/resources/zh-en-yue-9761.vocab) |
| zh-en-yue-11661 | 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | [zh-en-yue-11661](ssentencepiece/python/ssentencepiece/resources/zh-en-yue-11661.vocab) |
**Note**: The number of 3500, 6500 and 8105 is from [通用规范汉字表](http://www.moe.gov.cn/jyb_sjzl/ziliao/A19/201306/t20130601_186002.html).
Raw data
{
"_id": null,
"home_page": null,
"name": "simple-sentencepiece",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": "Next-gen Kaldi development team <wkang@pku.edu.cn>",
"download_url": "https://files.pythonhosted.org/packages/ef/f7/dd46656f2a31c7edbbcaa77224e5278ee850d53941561bcdeea6f2c6d80a/simple_sentencepiece-0.6.tar.gz",
"platform": null,
"description": "# simple-sentencepiece\nA simple sentencepiece encoder and decoder.\n\nNote: This is not a new sentencepiece toolkit, it just uses google's sentencepiece model\nas input and encode the string to ids/pieces or decode the ids to string. The advantage of\nthis tool is that it doesn't have any dependency (no protobuf), so it will be easier to\nintegrate it into a C++ project.\n\n\n## Installation\n\n```\npip install simple-sentencepiece\n```\n\n\n## Usage\n\nThe usage is very similar to sentencepiece, it also has `encode` and `decode` interface.\n\n```python\nfrom ssentencepiece import Ssentencepiece\n\n# you can get bpe.vocab from a trained bpe model, see google's sentencepiece for details\nssp = Ssentencepiece(\"/path/to/bpe.vocab\")\n\n# you can also use the default models provided by this package, see below for details\n# ssp = Ssentencepiece(\"gigaspeech-500\")\n\n# output ids\nids = ssp.encode([\"HELLO WORLD\", \"LOVE AND PIECE\"])\n\n# output string pieces\npieces = ssp.encode([\"HELLO WORLD\", \"LOVE AND PIECE\"], out_type=str)\n\n# decode\nres = ssp.decode(ids)\n```\n\n## Default models\n\n| Model Name | Description | Link |\n|------------|-------------|------|\n| alphabet-32| `<blk>`,`<unk>`, `<sos>`, `<eos>`, `'`, `\u2581` and 26 alphabets. | [alphabet-32](ssentencepiece/python/ssentencepiece/resources/alphabet-32.vocab) |\n| librispeech-500| 500 unigram pieces trained on Librispeech. | [librispeech-500](ssentencepiece/python/ssentencepiece/resources/librispeech-500.vocab) |\n| librispeech-5000| 5000 unigram pieces trained on Librispeech. | [librispeech-5000](ssentencepiece/python/ssentencepiece/resources/librispeech-5000.vocab) |\n| gigaspeech-500| 500 unigram pieces trained on Gigaspeech. | [gigaspeech-500](ssentencepiece/python/ssentencepiece/resources/gigaspeech-500.vocab) |\n| gigaspeech-2000| 2000 unigram pieces trained on Gigaspeech. | [gigaspeech-2000](ssentencepiece/python/ssentencepiece/resources/gigaspeech-2000.vocab) |\n| gigaspeech-5000| 5000 unigram pieces trained on Gigaspeech. | [gigaspeech-5000](ssentencepiece/python/ssentencepiece/resources/gigaspeech-5000.vocab) |\n| zh-en-3876 | 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | [zh-en-3876](ssentencepiece/python/ssentencepiece/resources/zh-en-3876.vocab) |\n| zh-en-6876 | 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | [zh-en-6876](ssentencepiece/python/ssentencepiece/resources/zh-en-6876.vocab) |\n| zh-en-8481 | 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | [zh-en-8481](ssentencepiece/python/ssentencepiece/resources/zh-en-8481.vocab) |\n| zh-en-5776 | 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | [zh-en-5776](ssentencepiece/python/ssentencepiece/resources/zh-en-5776.vocab) |\n| zh-en-8776 | 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | [zh-en-8776](ssentencepiece/python/ssentencepiece/resources/zh-en-8776.vocab) |\n| zh-en-10381 | 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | [zh-en-10381](ssentencepiece/python/ssentencepiece/resources/zh-en-10381.vocab) |\n| zh-en-yue-9761 | 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | [zh-en-yue-9761](ssentencepiece/python/ssentencepiece/resources/zh-en-yue-9761.vocab) |\n| zh-en-yue-11661 | 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | [zh-en-yue-11661](ssentencepiece/python/ssentencepiece/resources/zh-en-yue-11661.vocab) |\n\n**Note**: The number of 3500, 6500 and 8105 is from [\u901a\u7528\u89c4\u8303\u6c49\u5b57\u8868](http://www.moe.gov.cn/jyb_sjzl/ziliao/A19/201306/t20130601_186002.html).\n\n",
"bugtrack_url": null,
"license": null,
"summary": "A simple sentencepiece encoder and decoder without any dependency.",
"version": "0.6",
"project_urls": {
"Bug Tracker": "https://github.com/pkufool/simple-sentencepiece/issues",
"Homepage": "https://github.com/pkufool/simple-sentencepiece"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "773bcf245c6522738ba87f513a14d31fde9b0cc17d3652337d08b8ea70ba98d2",
"md5": "51c9762366564da54ed9c59b86ce9c2f",
"sha256": "6caff385efd41bb3cee2d49a710e8d2dfd8d89b50002fa6edfe925ddb6490b23"
},
"downloads": -1,
"filename": "simple_sentencepiece-0.6-cp310-cp310-macosx_10_9_x86_64.whl",
"has_sig": false,
"md5_digest": "51c9762366564da54ed9c59b86ce9c2f",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.8",
"size": 469954,
"upload_time": "2025-07-22T11:56:36",
"upload_time_iso_8601": "2025-07-22T11:56:36.097853Z",
"url": "https://files.pythonhosted.org/packages/77/3b/cf245c6522738ba87f513a14d31fde9b0cc17d3652337d08b8ea70ba98d2/simple_sentencepiece-0.6-cp310-cp310-macosx_10_9_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "734300126453be54f10ecae8ca85a4190848ae457fde65365471fbc1894bc306",
"md5": "ef9c7cf3ceb7f177069ab210b9ec0d74",
"sha256": "7584dfec633155f6670cef2cae5b4c38c9b164ad0133971bdac3bc6291c1a1bf"
},
"downloads": -1,
"filename": "simple_sentencepiece-0.6-cp311-cp311-macosx_10_9_x86_64.whl",
"has_sig": false,
"md5_digest": "ef9c7cf3ceb7f177069ab210b9ec0d74",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.8",
"size": 471488,
"upload_time": "2025-07-22T11:56:37",
"upload_time_iso_8601": "2025-07-22T11:56:37.169691Z",
"url": "https://files.pythonhosted.org/packages/73/43/00126453be54f10ecae8ca85a4190848ae457fde65365471fbc1894bc306/simple_sentencepiece-0.6-cp311-cp311-macosx_10_9_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "89700eac3bf4f8b56373cd75d77f28ea1049add9b1458e20be0120c69b8d04e3",
"md5": "bb314f1975c09bfef6aa419413413326",
"sha256": "f8c337e0cb97291512ffdad6ddb1fef2fecd25e0fed7cc38e412ee053f4160b2"
},
"downloads": -1,
"filename": "simple_sentencepiece-0.6-cp312-cp312-macosx_10_13_x86_64.whl",
"has_sig": false,
"md5_digest": "bb314f1975c09bfef6aa419413413326",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.8",
"size": 470109,
"upload_time": "2025-07-22T11:56:38",
"upload_time_iso_8601": "2025-07-22T11:56:38.153523Z",
"url": "https://files.pythonhosted.org/packages/89/70/0eac3bf4f8b56373cd75d77f28ea1049add9b1458e20be0120c69b8d04e3/simple_sentencepiece-0.6-cp312-cp312-macosx_10_13_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3db2efdaa217c90c518044a5a477a80d8595ff3c6e6482c30805a51ff0c41f3f",
"md5": "5f25a0296b14978407a3d09b8a4e8e13",
"sha256": "de33a6f7507c73fff156bf515f95d6c698e30b57c8c1a0d5cc14d93fb3ba3d96"
},
"downloads": -1,
"filename": "simple_sentencepiece-0.6-cp313-cp313-macosx_10_13_x86_64.whl",
"has_sig": false,
"md5_digest": "5f25a0296b14978407a3d09b8a4e8e13",
"packagetype": "bdist_wheel",
"python_version": "cp313",
"requires_python": ">=3.8",
"size": 470122,
"upload_time": "2025-07-22T11:56:39",
"upload_time_iso_8601": "2025-07-22T11:56:39.066437Z",
"url": "https://files.pythonhosted.org/packages/3d/b2/efdaa217c90c518044a5a477a80d8595ff3c6e6482c30805a51ff0c41f3f/simple_sentencepiece-0.6-cp313-cp313-macosx_10_13_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "fc865827c509d6c7292769f13a992a6e90d2db0d5c90dce705af2a7283931f8e",
"md5": "53ac74885077a8a0dac0145b319be846",
"sha256": "2af2dd4ebc05cea829bba49e6117a4f96e9a7f77eda7452ee392e6e32007fbfe"
},
"downloads": -1,
"filename": "simple_sentencepiece-0.6-cp38-cp38-macosx_10_9_x86_64.whl",
"has_sig": false,
"md5_digest": "53ac74885077a8a0dac0145b319be846",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8",
"size": 469638,
"upload_time": "2025-07-22T11:56:40",
"upload_time_iso_8601": "2025-07-22T11:56:40.083961Z",
"url": "https://files.pythonhosted.org/packages/fc/86/5827c509d6c7292769f13a992a6e90d2db0d5c90dce705af2a7283931f8e/simple_sentencepiece-0.6-cp38-cp38-macosx_10_9_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "08e4e078f1176d44efd91817ee4ba345d49f3cbb74aa53c482c4d7bfdb1afb26",
"md5": "d91c8ffece3997ddc0528d6c300eb3cc",
"sha256": "907b34063572dfcb37842d20010164742f159bbd15fd87821c4ea567112f6418"
},
"downloads": -1,
"filename": "simple_sentencepiece-0.6-cp39-cp39-macosx_10_9_x86_64.whl",
"has_sig": false,
"md5_digest": "d91c8ffece3997ddc0528d6c300eb3cc",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.8",
"size": 470036,
"upload_time": "2025-07-22T11:56:41",
"upload_time_iso_8601": "2025-07-22T11:56:41.537732Z",
"url": "https://files.pythonhosted.org/packages/08/e4/e078f1176d44efd91817ee4ba345d49f3cbb74aa53c482c4d7bfdb1afb26/simple_sentencepiece-0.6-cp39-cp39-macosx_10_9_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "eff7dd46656f2a31c7edbbcaa77224e5278ee850d53941561bcdeea6f2c6d80a",
"md5": "8e7f7006e7e219c2825bae122b8f81a5",
"sha256": "e184c69bc04c1148bbdb8b787307afc19228aeca3a49cbc7cf561debb2fafb61"
},
"downloads": -1,
"filename": "simple_sentencepiece-0.6.tar.gz",
"has_sig": false,
"md5_digest": "8e7f7006e7e219c2825bae122b8f81a5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 849460,
"upload_time": "2025-07-22T11:45:09",
"upload_time_iso_8601": "2025-07-22T11:45:09.274445Z",
"url": "https://files.pythonhosted.org/packages/ef/f7/dd46656f2a31c7edbbcaa77224e5278ee850d53941561bcdeea6f2c6d80a/simple_sentencepiece-0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-22 11:45:09",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "pkufool",
"github_project": "simple-sentencepiece",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "simple-sentencepiece"
}