mon-tokenizer


Namemon-tokenizer JSON
Version 0.1.3 PyPI version JSON
download
home_pageNone
SummaryA simple tokenizer for Mon text
upload_time2025-08-23 15:45:38
maintainerNone
docs_urlNone
authorCode-Yay-Mal
requires_python>=3.11
licenseMIT
keywords mon tokenizer nlp myanmar text-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Mon Tokenizer

Tokenize Mon text like a pro. No fancy stuff, just gets the job done.

## Quick Start

```bash
# Using pip
pip install mon-tokenizer

# Using uv (faster)
uv add mon-tokenizer
```

```python
from mon_tokenizer import MonTokenizer

tokenizer = MonTokenizer()
text = "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"

# Tokenize
result = tokenizer.encode(text)
print(result["pieces"])  # ['▁ဂွံ', 'အခေါင်', 'အရာ', 'မွဲ', 'သ္ဂောံ', 'ဒုင်စသိုင်', 'ကၠာ', 'ကၠာ', 'ရ', '။']

# Decode
decoded = tokenizer.decode(result["pieces"])
print(decoded)  # ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။
```

## CLI

```bash
# Tokenize
mon-tokenizer "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"

# Verbose mode (shows all the details)
mon-tokenizer -v "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"

# Decode tokens back to text
mon-tokenizer -d -t "▁ဂွံ,အခေါင်,အရာ,မွဲ,သ္ဂောံ,ဒုင်စသိုင်,ကၠာ,ကၠာ,ရ,။"
```

## API

- `encode(text)` - Chop text into tokens
- `decode(pieces)` - Glue tokens back together
- `decode_ids(ids)` - Convert IDs back to text
- `get_vocab_size()` - How many tokens we know
- `get_vocab()` - The whole vocabulary

## Dev Setup

```bash
git clone git@github.com:janakhpon/mon_tokenizer.git
cd mon_tokenizer
uv sync --dev
uv run pytest

# Release workflow
uv version --bump patch
git add pyproject.toml
git commit -m "v0.1.1"
git tag v0.1.1
git push origin main --tags
```

## License

MIT - do whatever you want with it.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "mon-tokenizer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "mon, tokenizer, nlp, myanmar, text-processing",
    "author": "Code-Yay-Mal",
    "author_email": "Code-Yay-Mal <jnovaxer@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/12/d7/32bb8692578436bb1fd4e0e7c645f3e9f9fb964ccad0b58eb151e1590028/mon_tokenizer-0.1.3.tar.gz",
    "platform": null,
    "description": "# Mon Tokenizer\n\nTokenize Mon text like a pro. No fancy stuff, just gets the job done.\n\n## Quick Start\n\n```bash\n# Using pip\npip install mon-tokenizer\n\n# Using uv (faster)\nuv add mon-tokenizer\n```\n\n```python\nfrom mon_tokenizer import MonTokenizer\n\ntokenizer = MonTokenizer()\ntext = \"\u1002\u103d\u1036\u1021\u1001\u1031\u102b\u1004\u103a\u1021\u101b\u102c\u1019\u103d\u1032\u101e\u1039\u1002\u1031\u102c\u1036\u1012\u102f\u1004\u103a\u1005\u101e\u102d\u102f\u1004\u103a\u1000\u1060\u102c\u1000\u1060\u102c\u101b\u104b\"\n\n# Tokenize\nresult = tokenizer.encode(text)\nprint(result[\"pieces\"])  # ['\u2581\u1002\u103d\u1036', '\u1021\u1001\u1031\u102b\u1004\u103a', '\u1021\u101b\u102c', '\u1019\u103d\u1032', '\u101e\u1039\u1002\u1031\u102c\u1036', '\u1012\u102f\u1004\u103a\u1005\u101e\u102d\u102f\u1004\u103a', '\u1000\u1060\u102c', '\u1000\u1060\u102c', '\u101b', '\u104b']\n\n# Decode\ndecoded = tokenizer.decode(result[\"pieces\"])\nprint(decoded)  # \u1002\u103d\u1036\u1021\u1001\u1031\u102b\u1004\u103a\u1021\u101b\u102c\u1019\u103d\u1032\u101e\u1039\u1002\u1031\u102c\u1036\u1012\u102f\u1004\u103a\u1005\u101e\u102d\u102f\u1004\u103a\u1000\u1060\u102c\u1000\u1060\u102c\u101b\u104b\n```\n\n## CLI\n\n```bash\n# Tokenize\nmon-tokenizer \"\u1002\u103d\u1036\u1021\u1001\u1031\u102b\u1004\u103a\u1021\u101b\u102c\u1019\u103d\u1032\u101e\u1039\u1002\u1031\u102c\u1036\u1012\u102f\u1004\u103a\u1005\u101e\u102d\u102f\u1004\u103a\u1000\u1060\u102c\u1000\u1060\u102c\u101b\u104b\"\n\n# Verbose mode (shows all the details)\nmon-tokenizer -v \"\u1002\u103d\u1036\u1021\u1001\u1031\u102b\u1004\u103a\u1021\u101b\u102c\u1019\u103d\u1032\u101e\u1039\u1002\u1031\u102c\u1036\u1012\u102f\u1004\u103a\u1005\u101e\u102d\u102f\u1004\u103a\u1000\u1060\u102c\u1000\u1060\u102c\u101b\u104b\"\n\n# Decode tokens back to text\nmon-tokenizer -d -t \"\u2581\u1002\u103d\u1036,\u1021\u1001\u1031\u102b\u1004\u103a,\u1021\u101b\u102c,\u1019\u103d\u1032,\u101e\u1039\u1002\u1031\u102c\u1036,\u1012\u102f\u1004\u103a\u1005\u101e\u102d\u102f\u1004\u103a,\u1000\u1060\u102c,\u1000\u1060\u102c,\u101b,\u104b\"\n```\n\n## API\n\n- `encode(text)` - Chop text into tokens\n- `decode(pieces)` - Glue tokens back together\n- `decode_ids(ids)` - Convert IDs back to text\n- `get_vocab_size()` - How many tokens we know\n- `get_vocab()` - The whole vocabulary\n\n## Dev Setup\n\n```bash\ngit clone git@github.com:janakhpon/mon_tokenizer.git\ncd mon_tokenizer\nuv sync --dev\nuv run pytest\n\n# Release workflow\nuv version --bump patch\ngit add pyproject.toml\ngit commit -m \"v0.1.1\"\ngit tag v0.1.1\ngit push origin main --tags\n```\n\n## License\n\nMIT - do whatever you want with it.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A simple tokenizer for Mon text",
    "version": "0.1.3",
    "project_urls": {
        "Changelog": "https://github.com/Code-Yay-Mal/mon_tokenizer/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/Code-Yay-Mal/mon_tokenizer#readme",
        "Homepage": "https://github.com/Code-Yay-Mal/mon_tokenizer",
        "Issues": "https://github.com/Code-Yay-Mal/mon_tokenizer/issues",
        "Repository": "https://github.com/Code-Yay-Mal/mon_tokenizer"
    },
    "split_keywords": [
        "mon",
        " tokenizer",
        " nlp",
        " myanmar",
        " text-processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9a000a58dc0bf102aee94f6cd1032f10e7c11c7fe3185d91959589eb119ec805",
                "md5": "7f4e5e02777ddcd95f6ac3e5c603751f",
                "sha256": "d20ab80eebcf0f30f588b1cb041d79ccd484617d04156fa074ef66fb16548408"
            },
            "downloads": -1,
            "filename": "mon_tokenizer-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7f4e5e02777ddcd95f6ac3e5c603751f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 20829,
            "upload_time": "2025-08-23T15:45:37",
            "upload_time_iso_8601": "2025-08-23T15:45:37.034264Z",
            "url": "https://files.pythonhosted.org/packages/9a/00/0a58dc0bf102aee94f6cd1032f10e7c11c7fe3185d91959589eb119ec805/mon_tokenizer-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "12d732bb8692578436bb1fd4e0e7c645f3e9f9fb964ccad0b58eb151e1590028",
                "md5": "1fba68860cbaa96b9896d160efd67f61",
                "sha256": "e9e92fa96e9e479b4f538d680ff78878cd9a141e00eb87d1210d94a50b1ff5d6"
            },
            "downloads": -1,
            "filename": "mon_tokenizer-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "1fba68860cbaa96b9896d160efd67f61",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 20323,
            "upload_time": "2025-08-23T15:45:38",
            "upload_time_iso_8601": "2025-08-23T15:45:38.232896Z",
            "url": "https://files.pythonhosted.org/packages/12/d7/32bb8692578436bb1fd4e0e7c645f3e9f9fb964ccad0b58eb151e1590028/mon_tokenizer-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-23 15:45:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Code-Yay-Mal",
    "github_project": "mon_tokenizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "mon-tokenizer"
}
        
Elapsed time: 0.78098s