tamil-utils

Name	tamil-utils JSON
Version	0.4.0 JSON
	download
home_page	None
Summary	Tiny Tamil text utilities: normalize, tokenize, stopwords, graphemes, n-grams, syllables, Tamil collation; dataset preprocessor; optional spaCy tokenizer hook.
upload_time	2025-09-17 14:24:34
maintainer	None
docs_url	None
author	Arulnidhi Karunanidhi
requires_python	>=3.9
license	MIT
keywords	tamil indic nlp unicode rag spacy tokenization
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # tamil-utils

Tiny **Tamil-first** text utilities that make Unicode correctness & tokenization *boringly reliable*.

[![PyPI](https://img.shields.io/pypi/v/tamil-utils)](https://pypi.org/project/tamil-utils/)
[![CI](https://github.com/arulnidhii/tamil-utils/actions/workflows/ci.yml/badge.svg)](https://github.com/arulnidhii/tamil-utils/actions)

---

## Features

* **Core:** `normalize`, `tokens`, `remove_stopwords`, `graphemes`, `sents`, Tamil⇄ASCII **numerals**, **syllables** (approx), **Tamil collation** (ISO-15919 key)
* **Counts:** `ngrams`, `word_counts` (uni/bi/tri-grams, optional stopwords)
* **Pipelines:** JSONL **preprocessor** (CLI + Python) for RAG/ML corpora
* **Integrations (optional):**

  * **spaCy** tokenizer hook to mirror `tamil_utils.tokens`
  * **Hugging Face Datasets** export helper

> Docs: **[https://arulnidhii.github.io/tamil-utils/](https://arulnidhii.github.io/tamil-utils/)**

---

## Install

```bash
pip install tamil-utils

# optional extras
pip install "tamil-utils[spacy]"   # spaCy hook
pip install datasets               # HF datasets helper
```

---

## Quick start

```python
from tamil_utils import (
    normalize, tokens, remove_stopwords, graphemes, sents,
    to_arabic_numerals, syllables, sort_tamil, word_counts
)

s = "இது ஒரு சோதனை 👩🏽‍💻 ௨௦௨௫"

print(tokens(s))                                # ['இது','ஒரு','சோதனை','👩🏽‍💻','௨௦௨௫']
print(remove_stopwords(tokens(s), preset="ta")) # ['சோதனை','👩🏽‍💻','௨௦௨௫']
print(graphemes("👩🏽‍💻"))                       # ['👩🏽‍💻']
print(sents("இது ஒன்று. இது இரண்டு? சரி!"))      # ['இது ஒன்று.', 'இது இரண்டு?', 'சரி!']
print(to_arabic_numerals("௨௦௨௫"))                 # "2025"
print(syllables("தமிழ்"))                         # approx syllable-ish groups
print(sort_tamil(["இலங்கை","ஆதி","அடி"]))         # ['அடி','ஆதி','இலங்கை']
print(word_counts("தமிழ் NLP தமிழ் NLP", n=2, top=3))
```

---

## CLI

```bash
# JSONL preprocessor (one record per line)
python -m tamil_utils.cli preprocess --numerals ar --rmstop < input.txt > out.jsonl

# Word/n-gram counts
python -m tamil_utils.cli freq -n 2 --top 5 "தமிழ் NLP தமிழ் NLP"

# Tamil collation sort (ISO-15919 key)
python -m tamil_utils.cli sort "இலங்கை" "ஆதி" "அடி"
```

### Windows PowerShell

When piping Tamil text, prefer UTF-8 files or run with `python -X utf8`.

---

## spaCy tokenizer (optional)

```python
import spacy
from tamil_utils.spacy_hook import install_tamil_tokenizer

nlp = spacy.blank("xx")
install_tamil_tokenizer(nlp)
[t.text for t in nlp("இது ஒரு சோதனை 2025")]
# ['இது','ஒரு','சோதனை','2025']
```

---

## Hugging Face Datasets (optional)

```python
from tamil_utils.hf_export import to_hf_dataset  # requires: pip install datasets

records = [{"text": "இது ஒரு சோதனை 2025",
            "tokens": ["இது","ஒரு","சோதனை","2025"]}]
ds = to_hf_dataset(records)
print(ds)
```
## What’s new in v0.4” bullet:

Corpus utilities: normalize_punct, dedup_lines, filter_by_length, window_sents

New CLI: corpus-dedup, corpus-filter, corpus-windows

## License

MIT © Arulnidhi Karunanidhi

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "tamil-utils",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "tamil, indic, nlp, unicode, rag, spacy, tokenization",
    "author": "Arulnidhi Karunanidhi",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/c7/28/7a87fe56a628d1e9077294d1d10a8a2e9d802acc89c1bbf806a8c3ac2dae/tamil_utils-0.4.0.tar.gz",
    "platform": null,
    "description": "# tamil-utils\r\n\r\nTiny **Tamil-first** text utilities that make Unicode correctness & tokenization *boringly reliable*.\r\n\r\n[![PyPI](https://img.shields.io/pypi/v/tamil-utils)](https://pypi.org/project/tamil-utils/)\r\n[![CI](https://github.com/arulnidhii/tamil-utils/actions/workflows/ci.yml/badge.svg)](https://github.com/arulnidhii/tamil-utils/actions)\r\n\r\n---\r\n\r\n## Features\r\n\r\n* **Core:** `normalize`, `tokens`, `remove_stopwords`, `graphemes`, `sents`, Tamil\u21c4ASCII **numerals**, **syllables** (approx), **Tamil collation** (ISO-15919 key)\r\n* **Counts:** `ngrams`, `word_counts` (uni/bi/tri-grams, optional stopwords)\r\n* **Pipelines:** JSONL **preprocessor** (CLI + Python) for RAG/ML corpora\r\n* **Integrations (optional):**\r\n\r\n  * **spaCy** tokenizer hook to mirror `tamil_utils.tokens`\r\n  * **Hugging Face Datasets** export helper\r\n\r\n> Docs: **[https://arulnidhii.github.io/tamil-utils/](https://arulnidhii.github.io/tamil-utils/)**\r\n\r\n---\r\n\r\n## Install\r\n\r\n```bash\r\npip install tamil-utils\r\n\r\n# optional extras\r\npip install \"tamil-utils[spacy]\"   # spaCy hook\r\npip install datasets               # HF datasets helper\r\n```\r\n\r\n---\r\n\r\n## Quick start\r\n\r\n```python\r\nfrom tamil_utils import (\r\n    normalize, tokens, remove_stopwords, graphemes, sents,\r\n    to_arabic_numerals, syllables, sort_tamil, word_counts\r\n)\r\n\r\ns = \"\u0b87\u0ba4\u0bc1 \u0b92\u0bb0\u0bc1 \u0b9a\u0bcb\u0ba4\u0ba9\u0bc8 \ud83d\udc69\ud83c\udffd\u200d\ud83d\udcbb \u0be8\u0be6\u0be8\u0beb\"\r\n\r\nprint(tokens(s))                                # ['\u0b87\u0ba4\u0bc1','\u0b92\u0bb0\u0bc1','\u0b9a\u0bcb\u0ba4\u0ba9\u0bc8','\ud83d\udc69\ud83c\udffd\u200d\ud83d\udcbb','\u0be8\u0be6\u0be8\u0beb']\r\nprint(remove_stopwords(tokens(s), preset=\"ta\")) # ['\u0b9a\u0bcb\u0ba4\u0ba9\u0bc8','\ud83d\udc69\ud83c\udffd\u200d\ud83d\udcbb','\u0be8\u0be6\u0be8\u0beb']\r\nprint(graphemes(\"\ud83d\udc69\ud83c\udffd\u200d\ud83d\udcbb\"))                       # ['\ud83d\udc69\ud83c\udffd\u200d\ud83d\udcbb']\r\nprint(sents(\"\u0b87\u0ba4\u0bc1 \u0b92\u0ba9\u0bcd\u0bb1\u0bc1. \u0b87\u0ba4\u0bc1 \u0b87\u0bb0\u0ba3\u0bcd\u0b9f\u0bc1? \u0b9a\u0bb0\u0bbf!\"))      # ['\u0b87\u0ba4\u0bc1 \u0b92\u0ba9\u0bcd\u0bb1\u0bc1.', '\u0b87\u0ba4\u0bc1 \u0b87\u0bb0\u0ba3\u0bcd\u0b9f\u0bc1?', '\u0b9a\u0bb0\u0bbf!']\r\nprint(to_arabic_numerals(\"\u0be8\u0be6\u0be8\u0beb\"))                 # \"2025\"\r\nprint(syllables(\"\u0ba4\u0bae\u0bbf\u0bb4\u0bcd\"))                         # approx syllable-ish groups\r\nprint(sort_tamil([\"\u0b87\u0bb2\u0b99\u0bcd\u0b95\u0bc8\",\"\u0b86\u0ba4\u0bbf\",\"\u0b85\u0b9f\u0bbf\"]))         # ['\u0b85\u0b9f\u0bbf','\u0b86\u0ba4\u0bbf','\u0b87\u0bb2\u0b99\u0bcd\u0b95\u0bc8']\r\nprint(word_counts(\"\u0ba4\u0bae\u0bbf\u0bb4\u0bcd NLP \u0ba4\u0bae\u0bbf\u0bb4\u0bcd NLP\", n=2, top=3))\r\n```\r\n\r\n---\r\n\r\n## CLI\r\n\r\n```bash\r\n# JSONL preprocessor (one record per line)\r\npython -m tamil_utils.cli preprocess --numerals ar --rmstop < input.txt > out.jsonl\r\n\r\n# Word/n-gram counts\r\npython -m tamil_utils.cli freq -n 2 --top 5 \"\u0ba4\u0bae\u0bbf\u0bb4\u0bcd NLP \u0ba4\u0bae\u0bbf\u0bb4\u0bcd NLP\"\r\n\r\n# Tamil collation sort (ISO-15919 key)\r\npython -m tamil_utils.cli sort \"\u0b87\u0bb2\u0b99\u0bcd\u0b95\u0bc8\" \"\u0b86\u0ba4\u0bbf\" \"\u0b85\u0b9f\u0bbf\"\r\n```\r\n\r\n### Windows PowerShell\r\n\r\nWhen piping Tamil text, prefer UTF-8 files or run with `python -X utf8`.\r\n\r\n---\r\n\r\n## spaCy tokenizer (optional)\r\n\r\n```python\r\nimport spacy\r\nfrom tamil_utils.spacy_hook import install_tamil_tokenizer\r\n\r\nnlp = spacy.blank(\"xx\")\r\ninstall_tamil_tokenizer(nlp)\r\n[t.text for t in nlp(\"\u0b87\u0ba4\u0bc1 \u0b92\u0bb0\u0bc1 \u0b9a\u0bcb\u0ba4\u0ba9\u0bc8 2025\")]\r\n# ['\u0b87\u0ba4\u0bc1','\u0b92\u0bb0\u0bc1','\u0b9a\u0bcb\u0ba4\u0ba9\u0bc8','2025']\r\n```\r\n\r\n---\r\n\r\n## Hugging Face Datasets (optional)\r\n\r\n```python\r\nfrom tamil_utils.hf_export import to_hf_dataset  # requires: pip install datasets\r\n\r\nrecords = [{\"text\": \"\u0b87\u0ba4\u0bc1 \u0b92\u0bb0\u0bc1 \u0b9a\u0bcb\u0ba4\u0ba9\u0bc8 2025\",\r\n            \"tokens\": [\"\u0b87\u0ba4\u0bc1\",\"\u0b92\u0bb0\u0bc1\",\"\u0b9a\u0bcb\u0ba4\u0ba9\u0bc8\",\"2025\"]}]\r\nds = to_hf_dataset(records)\r\nprint(ds)\r\n```\r\n## What\u2019s new in v0.4\u201d bullet:\r\n\r\nCorpus utilities: normalize_punct, dedup_lines, filter_by_length, window_sents\r\n\r\nNew CLI: corpus-dedup, corpus-filter, corpus-windows\r\n\r\n## License\r\n\r\nMIT \u00a9 Arulnidhi Karunanidhi\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Tiny Tamil text utilities: normalize, tokenize, stopwords, graphemes, n-grams, syllables, Tamil collation; dataset preprocessor; optional spaCy tokenizer hook.",
    "version": "0.4.0",
    "project_urls": {
        "Documentation": "https://arulnidhii.github.io/tamil-utils/",
        "Homepage": "https://github.com/arulnidhii/tamil-utils",
        "Issues": "https://github.com/arulnidhii/tamil-utils/issues",
        "Source": "https://github.com/arulnidhii/tamil-utils"
    },
    "split_keywords": [
        "tamil",
        " indic",
        " nlp",
        " unicode",
        " rag",
        " spacy",
        " tokenization"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ae7f4bc051d1a9ef5a76068f9a4d36b99135b35d28122da9702bca8d49fb72fd",
                "md5": "ccfd88ff9cac47ee8d2d7c457aa9a4c6",
                "sha256": "be63955f9cda539c5e735243ccf9c547a9afa0fff74713de2dfdb60e4c03bcf1"
            },
            "downloads": -1,
            "filename": "tamil_utils-0.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ccfd88ff9cac47ee8d2d7c457aa9a4c6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 17110,
            "upload_time": "2025-09-17T14:24:32",
            "upload_time_iso_8601": "2025-09-17T14:24:32.810372Z",
            "url": "https://files.pythonhosted.org/packages/ae/7f/4bc051d1a9ef5a76068f9a4d36b99135b35d28122da9702bca8d49fb72fd/tamil_utils-0.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c7287a87fe56a628d1e9077294d1d10a8a2e9d802acc89c1bbf806a8c3ac2dae",
                "md5": "b93de327bf8b286b5c7b642f0b2db28f",
                "sha256": "870ef74bdde86b8dd28db36c0efb9f241b9635a1e0ce170cb8bc8aea1a2f0804"
            },
            "downloads": -1,
            "filename": "tamil_utils-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "b93de327bf8b286b5c7b642f0b2db28f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 19773,
            "upload_time": "2025-09-17T14:24:34",
            "upload_time_iso_8601": "2025-09-17T14:24:34.321519Z",
            "url": "https://files.pythonhosted.org/packages/c7/28/7a87fe56a628d1e9077294d1d10a8a2e9d802acc89c1bbf806a8c3ac2dae/tamil_utils-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-17 14:24:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "arulnidhii",
    "github_project": "tamil-utils",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "tamil-utils"
}

Arulnidhi Karunanidhi