sinlib


Namesinlib JSON
Version 0.1.10 PyPI version JSON
download
home_pageNone
SummarySinhala NLP Toolkit
upload_time2025-08-10 04:54:58
maintainerNone
docs_urlNone
authorNone
requires_python<=3.12,>=3.9.7
licenseMIT License Copyright (c) [2024] [Ransaka Ravihara] Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords nlp sinhala python
VCS
bugtrack_url
requirements numpy torch tqdm huggingface_hub
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# SINLIB

<div align="center">

![Sinlib Logo](welcome.png)

[![PyPI version](https://badge.fury.io/py/sinlib.svg)](https://badge.fury.io/py/sinlib)
[![Python Versions](https://img.shields.io/pypi/pyversions/sinlib.svg)](https://pypi.org/project/sinlib/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A Python library for Sinhala text processing and analysis
</div>

> **Note:** The `Romanizer` and `Transliterator` modules are temporarily unavailable due to a potential bug. We are working to resolve this and will restore them in a future update.

## Overview

Sinlib is a specialized Python library designed for processing and analyzing Sinhala text. It provides tools for tokenization, preprocessing, and romanization to facilitate natural language processing tasks for the Sinhala language.

## Features

- **Tokenizer**: Tokenization for Sinhala text
- **Preprocessor**: Text preprocessing utilities including Sinhala character ratio analysis
- **Romanizer**: Convert Sinhala text to Roman characters

## Installation

Install the latest stable version from PyPI:

```bash
pip install sinlib
```

## Usage Examples

### Tokenizer

Split Sinhala text into meaningful tokens:

```python
from sinlib import Tokenizer

# Sample Sinhala text
corpus = """මේ අතර, පෙබරවාරි මාසයේ පළමු දින 08 තුළ පමණක් විදෙස් සංචාරකයන් 60,122 දෙනෙකු මෙරටට පැමිණ තිබේ.
ඒ අනුව මේ වසරේ ගත වූ කාලය තුළ සංචාරකයන් 268‍,375 දෙනෙකු දිවයිනට පැමිණ ඇති බව සංචාරක සංවර්ධන අධිකාරිය සඳහන් කරයි.
ඉන් වැඩි ම සංචාරකයන් පිරිසක් ඉන්දියාවෙන් පැමිණ ඇති අතර, එම සංඛ්‍යාව 42,768කි.
ඊට අමතර ව රුසියාවෙන් සංචාරකයන් 39,914ක්, බ්‍රිතාන්‍යයෙන් 22,278ක් සහ ජර්මනියෙන් සංචාරකයන් 18,016 දෙනෙකු පැමිණ ඇති බව වාර්තා වේ."""

# Initialize and train the tokenizer
tokenizer = Tokenizer()
tokenizer.train([corpus])

# Encode text into tokens
encoding = tokenizer("මේ අතර, පෙබරවාරි මාසයේ පළමු")

# List tokens
tokens = [tokenizer.token_id_to_token_map[id] for id in encoding]
print(tokens)
# Output: ['මේ', ' ', 'අ', 'ත', 'ර', ',', ' ', 'පෙ', 'බ', 'ර', 'වා', 'රි', ' ', 'මා', 'ස', 'යේ', ' ', 'ප', 'ළ', 'මු']
```

### Preprocessor

Analyze Sinhala character ratio in text:

```python
from sinlib.preprocessing import get_sinhala_character_ratio

# Sample sentences with varying Sinhala content
sentences = [
    'මෙය සිංහල වාක්‍යක්',                                  # Full Sinhala
    'මෙය සිංහල වාක්‍යක් සමග english character කීපයක්',     # Mixed Sinhala and English
    'This is a complete English sentence'                   # Full English
]

# Calculate Sinhala character ratio for each sentence
ratios = get_sinhala_character_ratio(sentences)
print(ratios)
# Output: [0.9, 0.46875, 0.0]
```

### Spell Checker (beta)

Detect typos and get spelling suggestions for Sinhala words using n gram models:
```python 
from sinlib.spellcheck import TypoDetector

# Initialize the typo detector
typo_detector = TypoDetector()

# Check spelling of a word
result = typo_detector("අඩිරාජයාගේ")
print(result) # ['අධිරාජයාගේ', 'අධිරාජ්\u200dයයාගේ', 'අධිරාජයා']
# Output: Either the word itself if correct, or a list of suggestions if it's a potential typo
```

### Romanizer

Convert Sinhala text to Roman characters:

```python
from sinlib import Romanizer

# Sample texts with Sinhala content
texts = [
    "hello, මේ මාසයේ ගත වූ දින 15ක කාලය තුළ කොළඹ නගරය ආශ්‍රිත ව",
    "මෑතකාලීන ව රට මුහුණ දුන් අභියෝගාත්මකම ආර්ථික කාරණාව ණය ප්‍රතිව්‍යුගතකරණය බව"
]

# Initialize the romanizer
romanizer = Romanizer(char_mapper_fp=None, tokenizer_vocab_path=None)

# Romanize the texts
romanized_texts = romanizer(texts)
print(romanized_texts)
# Output:
# ['hello, me masaye gatha wu dina 15ka kalaya thula kolaba nagaraya ashritha wa',
#  'methakaleena wa rata muhuna dun abhiyogathmakama arthika karanawa naya prathiwyugathakaranaya bawa']
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgements

- Thanks to all contributors who have helped with the development of Sinlib
- Special thanks to the Sinhala NLP community for their support and feedback

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "sinlib",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<=3.12,>=3.9.7",
    "maintainer_email": null,
    "keywords": "NLP, Sinhala, python",
    "author": null,
    "author_email": "Ransaka <ransaka.ravihara@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/88/ca/23c65f7128e3227d27a3c34198b59e8a196f8e92e02937ad4a005bb2e181/sinlib-0.1.10.tar.gz",
    "platform": null,
    "description": "\n# SINLIB\n\n<div align=\"center\">\n\n![Sinlib Logo](welcome.png)\n\n[![PyPI version](https://badge.fury.io/py/sinlib.svg)](https://badge.fury.io/py/sinlib)\n[![Python Versions](https://img.shields.io/pypi/pyversions/sinlib.svg)](https://pypi.org/project/sinlib/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nA Python library for Sinhala text processing and analysis\n</div>\n\n> **Note:** The `Romanizer` and `Transliterator` modules are temporarily unavailable due to a potential bug. We are working to resolve this and will restore them in a future update.\n\n## Overview\n\nSinlib is a specialized Python library designed for processing and analyzing Sinhala text. It provides tools for tokenization, preprocessing, and romanization to facilitate natural language processing tasks for the Sinhala language.\n\n## Features\n\n- **Tokenizer**: Tokenization for Sinhala text\n- **Preprocessor**: Text preprocessing utilities including Sinhala character ratio analysis\n- **Romanizer**: Convert Sinhala text to Roman characters\n\n## Installation\n\nInstall the latest stable version from PyPI:\n\n```bash\npip install sinlib\n```\n\n## Usage Examples\n\n### Tokenizer\n\nSplit Sinhala text into meaningful tokens:\n\n```python\nfrom sinlib import Tokenizer\n\n# Sample Sinhala text\ncorpus = \"\"\"\u0db8\u0dda \u0d85\u0dad\u0dbb, \u0db4\u0dd9\u0db6\u0dbb\u0dc0\u0dcf\u0dbb\u0dd2 \u0db8\u0dcf\u0dc3\u0dba\u0dda \u0db4\u0dc5\u0db8\u0dd4 \u0daf\u0dd2\u0db1 08 \u0dad\u0dd4\u0dc5 \u0db4\u0db8\u0dab\u0d9a\u0dca \u0dc0\u0dd2\u0daf\u0dd9\u0dc3\u0dca \u0dc3\u0d82\u0da0\u0dcf\u0dbb\u0d9a\u0dba\u0db1\u0dca 60,122 \u0daf\u0dd9\u0db1\u0dd9\u0d9a\u0dd4 \u0db8\u0dd9\u0dbb\u0da7\u0da7 \u0db4\u0dd0\u0db8\u0dd2\u0dab \u0dad\u0dd2\u0db6\u0dda.\n\u0d92 \u0d85\u0db1\u0dd4\u0dc0 \u0db8\u0dda \u0dc0\u0dc3\u0dbb\u0dda \u0d9c\u0dad \u0dc0\u0dd6 \u0d9a\u0dcf\u0dbd\u0dba \u0dad\u0dd4\u0dc5 \u0dc3\u0d82\u0da0\u0dcf\u0dbb\u0d9a\u0dba\u0db1\u0dca 268\u200d,375 \u0daf\u0dd9\u0db1\u0dd9\u0d9a\u0dd4 \u0daf\u0dd2\u0dc0\u0dba\u0dd2\u0db1\u0da7 \u0db4\u0dd0\u0db8\u0dd2\u0dab \u0d87\u0dad\u0dd2 \u0db6\u0dc0 \u0dc3\u0d82\u0da0\u0dcf\u0dbb\u0d9a \u0dc3\u0d82\u0dc0\u0dbb\u0dca\u0db0\u0db1 \u0d85\u0db0\u0dd2\u0d9a\u0dcf\u0dbb\u0dd2\u0dba \u0dc3\u0db3\u0dc4\u0db1\u0dca \u0d9a\u0dbb\u0dba\u0dd2.\n\u0d89\u0db1\u0dca \u0dc0\u0dd0\u0da9\u0dd2 \u0db8 \u0dc3\u0d82\u0da0\u0dcf\u0dbb\u0d9a\u0dba\u0db1\u0dca \u0db4\u0dd2\u0dbb\u0dd2\u0dc3\u0d9a\u0dca \u0d89\u0db1\u0dca\u0daf\u0dd2\u0dba\u0dcf\u0dc0\u0dd9\u0db1\u0dca \u0db4\u0dd0\u0db8\u0dd2\u0dab \u0d87\u0dad\u0dd2 \u0d85\u0dad\u0dbb, \u0d91\u0db8 \u0dc3\u0d82\u0d9b\u0dca\u200d\u0dba\u0dcf\u0dc0 42,768\u0d9a\u0dd2.\n\u0d8a\u0da7 \u0d85\u0db8\u0dad\u0dbb \u0dc0 \u0dbb\u0dd4\u0dc3\u0dd2\u0dba\u0dcf\u0dc0\u0dd9\u0db1\u0dca \u0dc3\u0d82\u0da0\u0dcf\u0dbb\u0d9a\u0dba\u0db1\u0dca 39,914\u0d9a\u0dca, \u0db6\u0dca\u200d\u0dbb\u0dd2\u0dad\u0dcf\u0db1\u0dca\u200d\u0dba\u0dba\u0dd9\u0db1\u0dca 22,278\u0d9a\u0dca \u0dc3\u0dc4 \u0da2\u0dbb\u0dca\u0db8\u0db1\u0dd2\u0dba\u0dd9\u0db1\u0dca \u0dc3\u0d82\u0da0\u0dcf\u0dbb\u0d9a\u0dba\u0db1\u0dca 18,016 \u0daf\u0dd9\u0db1\u0dd9\u0d9a\u0dd4 \u0db4\u0dd0\u0db8\u0dd2\u0dab \u0d87\u0dad\u0dd2 \u0db6\u0dc0 \u0dc0\u0dcf\u0dbb\u0dca\u0dad\u0dcf \u0dc0\u0dda.\"\"\"\n\n# Initialize and train the tokenizer\ntokenizer = Tokenizer()\ntokenizer.train([corpus])\n\n# Encode text into tokens\nencoding = tokenizer(\"\u0db8\u0dda \u0d85\u0dad\u0dbb, \u0db4\u0dd9\u0db6\u0dbb\u0dc0\u0dcf\u0dbb\u0dd2 \u0db8\u0dcf\u0dc3\u0dba\u0dda \u0db4\u0dc5\u0db8\u0dd4\")\n\n# List tokens\ntokens = [tokenizer.token_id_to_token_map[id] for id in encoding]\nprint(tokens)\n# Output: ['\u0db8\u0dda', ' ', '\u0d85', '\u0dad', '\u0dbb', ',', ' ', '\u0db4\u0dd9', '\u0db6', '\u0dbb', '\u0dc0\u0dcf', '\u0dbb\u0dd2', ' ', '\u0db8\u0dcf', '\u0dc3', '\u0dba\u0dda', ' ', '\u0db4', '\u0dc5', '\u0db8\u0dd4']\n```\n\n### Preprocessor\n\nAnalyze Sinhala character ratio in text:\n\n```python\nfrom sinlib.preprocessing import get_sinhala_character_ratio\n\n# Sample sentences with varying Sinhala content\nsentences = [\n    '\u0db8\u0dd9\u0dba \u0dc3\u0dd2\u0d82\u0dc4\u0dbd \u0dc0\u0dcf\u0d9a\u0dca\u200d\u0dba\u0d9a\u0dca',                                  # Full Sinhala\n    '\u0db8\u0dd9\u0dba \u0dc3\u0dd2\u0d82\u0dc4\u0dbd \u0dc0\u0dcf\u0d9a\u0dca\u200d\u0dba\u0d9a\u0dca \u0dc3\u0db8\u0d9c english character \u0d9a\u0dd3\u0db4\u0dba\u0d9a\u0dca',     # Mixed Sinhala and English\n    'This is a complete English sentence'                   # Full English\n]\n\n# Calculate Sinhala character ratio for each sentence\nratios = get_sinhala_character_ratio(sentences)\nprint(ratios)\n# Output: [0.9, 0.46875, 0.0]\n```\n\n### Spell Checker (beta)\n\nDetect typos and get spelling suggestions for Sinhala words using n gram models:\n```python \nfrom sinlib.spellcheck import TypoDetector\n\n# Initialize the typo detector\ntypo_detector = TypoDetector()\n\n# Check spelling of a word\nresult = typo_detector(\"\u0d85\u0da9\u0dd2\u0dbb\u0dcf\u0da2\u0dba\u0dcf\u0d9c\u0dda\")\nprint(result) # ['\u0d85\u0db0\u0dd2\u0dbb\u0dcf\u0da2\u0dba\u0dcf\u0d9c\u0dda', '\u0d85\u0db0\u0dd2\u0dbb\u0dcf\u0da2\u0dca\\u200d\u0dba\u0dba\u0dcf\u0d9c\u0dda', '\u0d85\u0db0\u0dd2\u0dbb\u0dcf\u0da2\u0dba\u0dcf']\n# Output: Either the word itself if correct, or a list of suggestions if it's a potential typo\n```\n\n### Romanizer\n\nConvert Sinhala text to Roman characters:\n\n```python\nfrom sinlib import Romanizer\n\n# Sample texts with Sinhala content\ntexts = [\n    \"hello, \u0db8\u0dda \u0db8\u0dcf\u0dc3\u0dba\u0dda \u0d9c\u0dad \u0dc0\u0dd6 \u0daf\u0dd2\u0db1 15\u0d9a \u0d9a\u0dcf\u0dbd\u0dba \u0dad\u0dd4\u0dc5 \u0d9a\u0ddc\u0dc5\u0db9 \u0db1\u0d9c\u0dbb\u0dba \u0d86\u0dc1\u0dca\u200d\u0dbb\u0dd2\u0dad \u0dc0\",\n    \"\u0db8\u0dd1\u0dad\u0d9a\u0dcf\u0dbd\u0dd3\u0db1 \u0dc0 \u0dbb\u0da7 \u0db8\u0dd4\u0dc4\u0dd4\u0dab \u0daf\u0dd4\u0db1\u0dca \u0d85\u0db7\u0dd2\u0dba\u0ddd\u0d9c\u0dcf\u0dad\u0dca\u0db8\u0d9a\u0db8 \u0d86\u0dbb\u0dca\u0dae\u0dd2\u0d9a \u0d9a\u0dcf\u0dbb\u0dab\u0dcf\u0dc0 \u0dab\u0dba \u0db4\u0dca\u200d\u0dbb\u0dad\u0dd2\u0dc0\u0dca\u200d\u0dba\u0dd4\u0d9c\u0dad\u0d9a\u0dbb\u0dab\u0dba \u0db6\u0dc0\"\n]\n\n# Initialize the romanizer\nromanizer = Romanizer(char_mapper_fp=None, tokenizer_vocab_path=None)\n\n# Romanize the texts\nromanized_texts = romanizer(texts)\nprint(romanized_texts)\n# Output:\n# ['hello, me masaye gatha wu dina 15ka kalaya thula kolaba nagaraya ashritha wa',\n#  'methakaleena wa rata muhuna dun abhiyogathmakama arthika karanawa naya prathiwyugathakaranaya bawa']\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add some amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgements\n\n- Thanks to all contributors who have helped with the development of Sinlib\n- Special thanks to the Sinhala NLP community for their support and feedback\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) [2024] [Ransaka Ravihara]\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.",
    "summary": "Sinhala NLP Toolkit",
    "version": "0.1.10",
    "project_urls": {
        "Code": "https://github.com/Ransaka/sinlib",
        "Docs": "https://github.com/Ransaka/sinlib"
    },
    "split_keywords": [
        "nlp",
        " sinhala",
        " python"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c84bf23d5d3ebc7434bdc0812d3970c32818579f8d16e8bbbba969aab21019bc",
                "md5": "668f3326dfa4e6c0bbb61b32e463cb43",
                "sha256": "26910a576dc2dff33837835b42778a1863d4b483ea7e347bfeb34e468cfa4f3b"
            },
            "downloads": -1,
            "filename": "sinlib-0.1.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "668f3326dfa4e6c0bbb61b32e463cb43",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<=3.12,>=3.9.7",
            "size": 4221234,
            "upload_time": "2025-08-10T04:54:57",
            "upload_time_iso_8601": "2025-08-10T04:54:57.141580Z",
            "url": "https://files.pythonhosted.org/packages/c8/4b/f23d5d3ebc7434bdc0812d3970c32818579f8d16e8bbbba969aab21019bc/sinlib-0.1.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "88ca23c65f7128e3227d27a3c34198b59e8a196f8e92e02937ad4a005bb2e181",
                "md5": "37ec2a9134767282c72de2329306a988",
                "sha256": "2bdd241486d773f0e8046f5d0207a99a23a29e6109bcff1b21b05c2e142fd67f"
            },
            "downloads": -1,
            "filename": "sinlib-0.1.10.tar.gz",
            "has_sig": false,
            "md5_digest": "37ec2a9134767282c72de2329306a988",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<=3.12,>=3.9.7",
            "size": 4367153,
            "upload_time": "2025-08-10T04:54:58",
            "upload_time_iso_8601": "2025-08-10T04:54:58.485015Z",
            "url": "https://files.pythonhosted.org/packages/88/ca/23c65f7128e3227d27a3c34198b59e8a196f8e92e02937ad4a005bb2e181/sinlib-0.1.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-10 04:54:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Ransaka",
    "github_project": "sinlib",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.24.0"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.64.1"
                ]
            ]
        },
        {
            "name": "huggingface_hub",
            "specs": [
                [
                    "==",
                    "0.26.2"
                ]
            ]
        }
    ],
    "lcname": "sinlib"
}
        
Elapsed time: 2.29949s