# Burmese Tokenizer
Simple, fast Burmese text tokenization. No fancy stuff, just gets the job done.
## Install
```bash
pip install burmese-tokenizer
```
## Quick Start
```python
from burmese_tokenizer import BurmeseTokenizer
tokenizer = BurmeseTokenizer()
text = "မင်္ဂလာပါ။ နေကောင်းပါသလား။"
# tokenize
tokens = tokenizer.encode(text)
print(tokens["pieces"])
# ['▁မင်္ဂလာ', '▁ပါ', '။', '▁နေ', '▁ကောင်း', '▁ပါ', '▁သလား', '။']
# decode back
text = tokenizer.decode(tokens["pieces"])
print(text)
# မင်္ဂလာပါ။ နေကောင်းပါသလား။
```
## CLI
```bash
# tokenize
burmese-tokenizer "မင်္ဂလာပါ။"
# show details
burmese-tokenizer -v "မင်္ဂလာပါ။"
# decode tokens
burmese-tokenizer -d -t "▁မင်္ဂလာ,▁ပါ,။"
```
## API
- `encode(text)` - tokenize text
- `decode(pieces)` - convert tokens back to text
- `decode_ids(ids)` - convert ids to text
- `get_vocab_size()` - vocabulary size
- `get_vocab()` - full vocabulary
## Links
- [PyPI](https://pypi.org/project/burmese-tokenizer/)
- [Contributing](docs/how_to_contribute.md)
## License
MIT - Do whatever you want with it.
Raw data
{
"_id": null,
"home_page": null,
"name": "burmese-tokenizer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "burmese, tokenizer, nlp, myanmar, text-processing",
"author": "janakhpon",
"author_email": "janakhpon <jnovaxer@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/81/83/3d329f08f486d6a8f332d99de232cca62ed25aab4e37bad1886096e7dd14/burmese_tokenizer-0.1.3.tar.gz",
"platform": null,
"description": "# Burmese Tokenizer\n\nSimple, fast Burmese text tokenization. No fancy stuff, just gets the job done.\n\n## Install\n\n```bash\npip install burmese-tokenizer\n```\n\n## Quick Start\n\n```python\nfrom burmese_tokenizer import BurmeseTokenizer\n\ntokenizer = BurmeseTokenizer()\ntext = \"\u1019\u1004\u103a\u1039\u1002\u101c\u102c\u1015\u102b\u104b \u1014\u1031\u1000\u1031\u102c\u1004\u103a\u1038\u1015\u102b\u101e\u101c\u102c\u1038\u104b\"\n\n# tokenize\ntokens = tokenizer.encode(text)\nprint(tokens[\"pieces\"])\n# ['\u2581\u1019\u1004\u103a\u1039\u1002\u101c\u102c', '\u2581\u1015\u102b', '\u104b', '\u2581\u1014\u1031', '\u2581\u1000\u1031\u102c\u1004\u103a\u1038', '\u2581\u1015\u102b', '\u2581\u101e\u101c\u102c\u1038', '\u104b']\n\n# decode back\ntext = tokenizer.decode(tokens[\"pieces\"])\nprint(text)\n# \u1019\u1004\u103a\u1039\u1002\u101c\u102c\u1015\u102b\u104b \u1014\u1031\u1000\u1031\u102c\u1004\u103a\u1038\u1015\u102b\u101e\u101c\u102c\u1038\u104b\n```\n\n## CLI\n\n```bash\n# tokenize\nburmese-tokenizer \"\u1019\u1004\u103a\u1039\u1002\u101c\u102c\u1015\u102b\u104b\"\n\n# show details\nburmese-tokenizer -v \"\u1019\u1004\u103a\u1039\u1002\u101c\u102c\u1015\u102b\u104b\"\n\n# decode tokens\nburmese-tokenizer -d -t \"\u2581\u1019\u1004\u103a\u1039\u1002\u101c\u102c,\u2581\u1015\u102b,\u104b\"\n```\n\n## API\n\n- `encode(text)` - tokenize text\n- `decode(pieces)` - convert tokens back to text \n- `decode_ids(ids)` - convert ids to text\n- `get_vocab_size()` - vocabulary size\n- `get_vocab()` - full vocabulary\n\n## Links\n\n- [PyPI](https://pypi.org/project/burmese-tokenizer/)\n- [Contributing](docs/how_to_contribute.md)\n\n## License\n\nMIT - Do whatever you want with it.",
"bugtrack_url": null,
"license": "MIT",
"summary": "A simple tokenizer for Burmese text",
"version": "0.1.3",
"project_urls": {
"Changelog": "https://github.com/Code-Yay-Mal/burmese_tokenizer/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/Code-Yay-Mal/burmese_tokenizer#readme",
"Homepage": "https://github.com/Code-Yay-Mal/burmese_tokenizer",
"Issues": "https://github.com/Code-Yay-Mal/burmese_tokenizer/issues",
"Repository": "https://github.com/Code-Yay-Mal/burmese_tokenizer"
},
"split_keywords": [
"burmese",
" tokenizer",
" nlp",
" myanmar",
" text-processing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "0c49b48277f7b90e58198a82b01642fae29ec1ba7e8e33a2672a823d1a56667a",
"md5": "f0306f83d6d6ef0f4714c97f04ae487b",
"sha256": "a4d65ea467d57e7ea1d1a55ccd12e6de6c1c7d787f1a081fd5b336509bc04c68"
},
"downloads": -1,
"filename": "burmese_tokenizer-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f0306f83d6d6ef0f4714c97f04ae487b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 4965,
"upload_time": "2025-08-23T04:26:13",
"upload_time_iso_8601": "2025-08-23T04:26:13.150458Z",
"url": "https://files.pythonhosted.org/packages/0c/49/b48277f7b90e58198a82b01642fae29ec1ba7e8e33a2672a823d1a56667a/burmese_tokenizer-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "81833d329f08f486d6a8f332d99de232cca62ed25aab4e37bad1886096e7dd14",
"md5": "1c34ffeb7f84f7ab314831b97f6f6712",
"sha256": "5d7096e0f67635c9d50e6d2d5851d196c981d49beca87e68472e3a89b68a0100"
},
"downloads": -1,
"filename": "burmese_tokenizer-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "1c34ffeb7f84f7ab314831b97f6f6712",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 3510,
"upload_time": "2025-08-23T04:26:14",
"upload_time_iso_8601": "2025-08-23T04:26:14.206882Z",
"url": "https://files.pythonhosted.org/packages/81/83/3d329f08f486d6a8f332d99de232cca62ed25aab4e37bad1886096e7dd14/burmese_tokenizer-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-23 04:26:14",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Code-Yay-Mal",
"github_project": "burmese_tokenizer",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "burmese-tokenizer"
}