| Name | tamil-utils JSON |
| Version |
0.4.0
JSON |
| download |
| home_page | None |
| Summary | Tiny Tamil text utilities: normalize, tokenize, stopwords, graphemes, n-grams, syllables, Tamil collation; dataset preprocessor; optional spaCy tokenizer hook. |
| upload_time | 2025-09-17 14:24:34 |
| maintainer | None |
| docs_url | None |
| author | Arulnidhi Karunanidhi |
| requires_python | >=3.9 |
| license | MIT |
| keywords |
tamil
indic
nlp
unicode
rag
spacy
tokenization
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
# tamil-utils
Tiny **Tamil-first** text utilities that make Unicode correctness & tokenization *boringly reliable*.
[](https://pypi.org/project/tamil-utils/)
[](https://github.com/arulnidhii/tamil-utils/actions)
---
## Features
* **Core:** `normalize`, `tokens`, `remove_stopwords`, `graphemes`, `sents`, Tamil⇄ASCII **numerals**, **syllables** (approx), **Tamil collation** (ISO-15919 key)
* **Counts:** `ngrams`, `word_counts` (uni/bi/tri-grams, optional stopwords)
* **Pipelines:** JSONL **preprocessor** (CLI + Python) for RAG/ML corpora
* **Integrations (optional):**
* **spaCy** tokenizer hook to mirror `tamil_utils.tokens`
* **Hugging Face Datasets** export helper
> Docs: **[https://arulnidhii.github.io/tamil-utils/](https://arulnidhii.github.io/tamil-utils/)**
---
## Install
```bash
pip install tamil-utils
# optional extras
pip install "tamil-utils[spacy]" # spaCy hook
pip install datasets # HF datasets helper
```
---
## Quick start
```python
from tamil_utils import (
normalize, tokens, remove_stopwords, graphemes, sents,
to_arabic_numerals, syllables, sort_tamil, word_counts
)
s = "இது ஒரு சோதனை 👩🏽💻 ௨௦௨௫"
print(tokens(s)) # ['இது','ஒரு','சோதனை','👩🏽💻','௨௦௨௫']
print(remove_stopwords(tokens(s), preset="ta")) # ['சோதனை','👩🏽💻','௨௦௨௫']
print(graphemes("👩🏽💻")) # ['👩🏽💻']
print(sents("இது ஒன்று. இது இரண்டு? சரி!")) # ['இது ஒன்று.', 'இது இரண்டு?', 'சரி!']
print(to_arabic_numerals("௨௦௨௫")) # "2025"
print(syllables("தமிழ்")) # approx syllable-ish groups
print(sort_tamil(["இலங்கை","ஆதி","அடி"])) # ['அடி','ஆதி','இலங்கை']
print(word_counts("தமிழ் NLP தமிழ் NLP", n=2, top=3))
```
---
## CLI
```bash
# JSONL preprocessor (one record per line)
python -m tamil_utils.cli preprocess --numerals ar --rmstop < input.txt > out.jsonl
# Word/n-gram counts
python -m tamil_utils.cli freq -n 2 --top 5 "தமிழ் NLP தமிழ் NLP"
# Tamil collation sort (ISO-15919 key)
python -m tamil_utils.cli sort "இலங்கை" "ஆதி" "அடி"
```
### Windows PowerShell
When piping Tamil text, prefer UTF-8 files or run with `python -X utf8`.
---
## spaCy tokenizer (optional)
```python
import spacy
from tamil_utils.spacy_hook import install_tamil_tokenizer
nlp = spacy.blank("xx")
install_tamil_tokenizer(nlp)
[t.text for t in nlp("இது ஒரு சோதனை 2025")]
# ['இது','ஒரு','சோதனை','2025']
```
---
## Hugging Face Datasets (optional)
```python
from tamil_utils.hf_export import to_hf_dataset # requires: pip install datasets
records = [{"text": "இது ஒரு சோதனை 2025",
"tokens": ["இது","ஒரு","சோதனை","2025"]}]
ds = to_hf_dataset(records)
print(ds)
```
## What’s new in v0.4” bullet:
Corpus utilities: normalize_punct, dedup_lines, filter_by_length, window_sents
New CLI: corpus-dedup, corpus-filter, corpus-windows
## License
MIT © Arulnidhi Karunanidhi
Raw data
{
"_id": null,
"home_page": null,
"name": "tamil-utils",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "tamil, indic, nlp, unicode, rag, spacy, tokenization",
"author": "Arulnidhi Karunanidhi",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/c7/28/7a87fe56a628d1e9077294d1d10a8a2e9d802acc89c1bbf806a8c3ac2dae/tamil_utils-0.4.0.tar.gz",
"platform": null,
"description": "# tamil-utils\r\n\r\nTiny **Tamil-first** text utilities that make Unicode correctness & tokenization *boringly reliable*.\r\n\r\n[](https://pypi.org/project/tamil-utils/)\r\n[](https://github.com/arulnidhii/tamil-utils/actions)\r\n\r\n---\r\n\r\n## Features\r\n\r\n* **Core:** `normalize`, `tokens`, `remove_stopwords`, `graphemes`, `sents`, Tamil\u21c4ASCII **numerals**, **syllables** (approx), **Tamil collation** (ISO-15919 key)\r\n* **Counts:** `ngrams`, `word_counts` (uni/bi/tri-grams, optional stopwords)\r\n* **Pipelines:** JSONL **preprocessor** (CLI + Python) for RAG/ML corpora\r\n* **Integrations (optional):**\r\n\r\n * **spaCy** tokenizer hook to mirror `tamil_utils.tokens`\r\n * **Hugging Face Datasets** export helper\r\n\r\n> Docs: **[https://arulnidhii.github.io/tamil-utils/](https://arulnidhii.github.io/tamil-utils/)**\r\n\r\n---\r\n\r\n## Install\r\n\r\n```bash\r\npip install tamil-utils\r\n\r\n# optional extras\r\npip install \"tamil-utils[spacy]\" # spaCy hook\r\npip install datasets # HF datasets helper\r\n```\r\n\r\n---\r\n\r\n## Quick start\r\n\r\n```python\r\nfrom tamil_utils import (\r\n normalize, tokens, remove_stopwords, graphemes, sents,\r\n to_arabic_numerals, syllables, sort_tamil, word_counts\r\n)\r\n\r\ns = \"\u0b87\u0ba4\u0bc1 \u0b92\u0bb0\u0bc1 \u0b9a\u0bcb\u0ba4\u0ba9\u0bc8 \ud83d\udc69\ud83c\udffd\u200d\ud83d\udcbb \u0be8\u0be6\u0be8\u0beb\"\r\n\r\nprint(tokens(s)) # ['\u0b87\u0ba4\u0bc1','\u0b92\u0bb0\u0bc1','\u0b9a\u0bcb\u0ba4\u0ba9\u0bc8','\ud83d\udc69\ud83c\udffd\u200d\ud83d\udcbb','\u0be8\u0be6\u0be8\u0beb']\r\nprint(remove_stopwords(tokens(s), preset=\"ta\")) # ['\u0b9a\u0bcb\u0ba4\u0ba9\u0bc8','\ud83d\udc69\ud83c\udffd\u200d\ud83d\udcbb','\u0be8\u0be6\u0be8\u0beb']\r\nprint(graphemes(\"\ud83d\udc69\ud83c\udffd\u200d\ud83d\udcbb\")) # ['\ud83d\udc69\ud83c\udffd\u200d\ud83d\udcbb']\r\nprint(sents(\"\u0b87\u0ba4\u0bc1 \u0b92\u0ba9\u0bcd\u0bb1\u0bc1. \u0b87\u0ba4\u0bc1 \u0b87\u0bb0\u0ba3\u0bcd\u0b9f\u0bc1? \u0b9a\u0bb0\u0bbf!\")) # ['\u0b87\u0ba4\u0bc1 \u0b92\u0ba9\u0bcd\u0bb1\u0bc1.', '\u0b87\u0ba4\u0bc1 \u0b87\u0bb0\u0ba3\u0bcd\u0b9f\u0bc1?', '\u0b9a\u0bb0\u0bbf!']\r\nprint(to_arabic_numerals(\"\u0be8\u0be6\u0be8\u0beb\")) # \"2025\"\r\nprint(syllables(\"\u0ba4\u0bae\u0bbf\u0bb4\u0bcd\")) # approx syllable-ish groups\r\nprint(sort_tamil([\"\u0b87\u0bb2\u0b99\u0bcd\u0b95\u0bc8\",\"\u0b86\u0ba4\u0bbf\",\"\u0b85\u0b9f\u0bbf\"])) # ['\u0b85\u0b9f\u0bbf','\u0b86\u0ba4\u0bbf','\u0b87\u0bb2\u0b99\u0bcd\u0b95\u0bc8']\r\nprint(word_counts(\"\u0ba4\u0bae\u0bbf\u0bb4\u0bcd NLP \u0ba4\u0bae\u0bbf\u0bb4\u0bcd NLP\", n=2, top=3))\r\n```\r\n\r\n---\r\n\r\n## CLI\r\n\r\n```bash\r\n# JSONL preprocessor (one record per line)\r\npython -m tamil_utils.cli preprocess --numerals ar --rmstop < input.txt > out.jsonl\r\n\r\n# Word/n-gram counts\r\npython -m tamil_utils.cli freq -n 2 --top 5 \"\u0ba4\u0bae\u0bbf\u0bb4\u0bcd NLP \u0ba4\u0bae\u0bbf\u0bb4\u0bcd NLP\"\r\n\r\n# Tamil collation sort (ISO-15919 key)\r\npython -m tamil_utils.cli sort \"\u0b87\u0bb2\u0b99\u0bcd\u0b95\u0bc8\" \"\u0b86\u0ba4\u0bbf\" \"\u0b85\u0b9f\u0bbf\"\r\n```\r\n\r\n### Windows PowerShell\r\n\r\nWhen piping Tamil text, prefer UTF-8 files or run with `python -X utf8`.\r\n\r\n---\r\n\r\n## spaCy tokenizer (optional)\r\n\r\n```python\r\nimport spacy\r\nfrom tamil_utils.spacy_hook import install_tamil_tokenizer\r\n\r\nnlp = spacy.blank(\"xx\")\r\ninstall_tamil_tokenizer(nlp)\r\n[t.text for t in nlp(\"\u0b87\u0ba4\u0bc1 \u0b92\u0bb0\u0bc1 \u0b9a\u0bcb\u0ba4\u0ba9\u0bc8 2025\")]\r\n# ['\u0b87\u0ba4\u0bc1','\u0b92\u0bb0\u0bc1','\u0b9a\u0bcb\u0ba4\u0ba9\u0bc8','2025']\r\n```\r\n\r\n---\r\n\r\n## Hugging Face Datasets (optional)\r\n\r\n```python\r\nfrom tamil_utils.hf_export import to_hf_dataset # requires: pip install datasets\r\n\r\nrecords = [{\"text\": \"\u0b87\u0ba4\u0bc1 \u0b92\u0bb0\u0bc1 \u0b9a\u0bcb\u0ba4\u0ba9\u0bc8 2025\",\r\n \"tokens\": [\"\u0b87\u0ba4\u0bc1\",\"\u0b92\u0bb0\u0bc1\",\"\u0b9a\u0bcb\u0ba4\u0ba9\u0bc8\",\"2025\"]}]\r\nds = to_hf_dataset(records)\r\nprint(ds)\r\n```\r\n## What\u2019s new in v0.4\u201d bullet:\r\n\r\nCorpus utilities: normalize_punct, dedup_lines, filter_by_length, window_sents\r\n\r\nNew CLI: corpus-dedup, corpus-filter, corpus-windows\r\n\r\n## License\r\n\r\nMIT \u00a9 Arulnidhi Karunanidhi\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Tiny Tamil text utilities: normalize, tokenize, stopwords, graphemes, n-grams, syllables, Tamil collation; dataset preprocessor; optional spaCy tokenizer hook.",
"version": "0.4.0",
"project_urls": {
"Documentation": "https://arulnidhii.github.io/tamil-utils/",
"Homepage": "https://github.com/arulnidhii/tamil-utils",
"Issues": "https://github.com/arulnidhii/tamil-utils/issues",
"Source": "https://github.com/arulnidhii/tamil-utils"
},
"split_keywords": [
"tamil",
" indic",
" nlp",
" unicode",
" rag",
" spacy",
" tokenization"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "ae7f4bc051d1a9ef5a76068f9a4d36b99135b35d28122da9702bca8d49fb72fd",
"md5": "ccfd88ff9cac47ee8d2d7c457aa9a4c6",
"sha256": "be63955f9cda539c5e735243ccf9c547a9afa0fff74713de2dfdb60e4c03bcf1"
},
"downloads": -1,
"filename": "tamil_utils-0.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ccfd88ff9cac47ee8d2d7c457aa9a4c6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 17110,
"upload_time": "2025-09-17T14:24:32",
"upload_time_iso_8601": "2025-09-17T14:24:32.810372Z",
"url": "https://files.pythonhosted.org/packages/ae/7f/4bc051d1a9ef5a76068f9a4d36b99135b35d28122da9702bca8d49fb72fd/tamil_utils-0.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "c7287a87fe56a628d1e9077294d1d10a8a2e9d802acc89c1bbf806a8c3ac2dae",
"md5": "b93de327bf8b286b5c7b642f0b2db28f",
"sha256": "870ef74bdde86b8dd28db36c0efb9f241b9635a1e0ce170cb8bc8aea1a2f0804"
},
"downloads": -1,
"filename": "tamil_utils-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "b93de327bf8b286b5c7b642f0b2db28f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 19773,
"upload_time": "2025-09-17T14:24:34",
"upload_time_iso_8601": "2025-09-17T14:24:34.321519Z",
"url": "https://files.pythonhosted.org/packages/c7/28/7a87fe56a628d1e9077294d1d10a8a2e9d802acc89c1bbf806a8c3ac2dae/tamil_utils-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-17 14:24:34",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "arulnidhii",
"github_project": "tamil-utils",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "tamil-utils"
}