Name | sinlib JSON |
Version |
0.1.10
JSON |
| download |
home_page | None |
Summary | Sinhala NLP Toolkit |
upload_time | 2025-08-10 04:54:58 |
maintainer | None |
docs_url | None |
author | None |
requires_python | <=3.12,>=3.9.7 |
license | MIT License
Copyright (c) [2024] [Ransaka Ravihara]
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE. |
keywords |
nlp
sinhala
python
|
VCS |
 |
bugtrack_url |
|
requirements |
numpy
torch
tqdm
huggingface_hub
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# SINLIB
<div align="center">

[](https://badge.fury.io/py/sinlib)
[](https://pypi.org/project/sinlib/)
[](https://opensource.org/licenses/MIT)
A Python library for Sinhala text processing and analysis
</div>
> **Note:** The `Romanizer` and `Transliterator` modules are temporarily unavailable due to a potential bug. We are working to resolve this and will restore them in a future update.
## Overview
Sinlib is a specialized Python library designed for processing and analyzing Sinhala text. It provides tools for tokenization, preprocessing, and romanization to facilitate natural language processing tasks for the Sinhala language.
## Features
- **Tokenizer**: Tokenization for Sinhala text
- **Preprocessor**: Text preprocessing utilities including Sinhala character ratio analysis
- **Romanizer**: Convert Sinhala text to Roman characters
## Installation
Install the latest stable version from PyPI:
```bash
pip install sinlib
```
## Usage Examples
### Tokenizer
Split Sinhala text into meaningful tokens:
```python
from sinlib import Tokenizer
# Sample Sinhala text
corpus = """මේ අතර, පෙබරවාරි මාසයේ පළමු දින 08 තුළ පමණක් විදෙස් සංචාරකයන් 60,122 දෙනෙකු මෙරටට පැමිණ තිබේ.
ඒ අනුව මේ වසරේ ගත වූ කාලය තුළ සංචාරකයන් 268,375 දෙනෙකු දිවයිනට පැමිණ ඇති බව සංචාරක සංවර්ධන අධිකාරිය සඳහන් කරයි.
ඉන් වැඩි ම සංචාරකයන් පිරිසක් ඉන්දියාවෙන් පැමිණ ඇති අතර, එම සංඛ්යාව 42,768කි.
ඊට අමතර ව රුසියාවෙන් සංචාරකයන් 39,914ක්, බ්රිතාන්යයෙන් 22,278ක් සහ ජර්මනියෙන් සංචාරකයන් 18,016 දෙනෙකු පැමිණ ඇති බව වාර්තා වේ."""
# Initialize and train the tokenizer
tokenizer = Tokenizer()
tokenizer.train([corpus])
# Encode text into tokens
encoding = tokenizer("මේ අතර, පෙබරවාරි මාසයේ පළමු")
# List tokens
tokens = [tokenizer.token_id_to_token_map[id] for id in encoding]
print(tokens)
# Output: ['මේ', ' ', 'අ', 'ත', 'ර', ',', ' ', 'පෙ', 'බ', 'ර', 'වා', 'රි', ' ', 'මා', 'ස', 'යේ', ' ', 'ප', 'ළ', 'මු']
```
### Preprocessor
Analyze Sinhala character ratio in text:
```python
from sinlib.preprocessing import get_sinhala_character_ratio
# Sample sentences with varying Sinhala content
sentences = [
'මෙය සිංහල වාක්යක්', # Full Sinhala
'මෙය සිංහල වාක්යක් සමග english character කීපයක්', # Mixed Sinhala and English
'This is a complete English sentence' # Full English
]
# Calculate Sinhala character ratio for each sentence
ratios = get_sinhala_character_ratio(sentences)
print(ratios)
# Output: [0.9, 0.46875, 0.0]
```
### Spell Checker (beta)
Detect typos and get spelling suggestions for Sinhala words using n gram models:
```python
from sinlib.spellcheck import TypoDetector
# Initialize the typo detector
typo_detector = TypoDetector()
# Check spelling of a word
result = typo_detector("අඩිරාජයාගේ")
print(result) # ['අධිරාජයාගේ', 'අධිරාජ්\u200dයයාගේ', 'අධිරාජයා']
# Output: Either the word itself if correct, or a list of suggestions if it's a potential typo
```
### Romanizer
Convert Sinhala text to Roman characters:
```python
from sinlib import Romanizer
# Sample texts with Sinhala content
texts = [
"hello, මේ මාසයේ ගත වූ දින 15ක කාලය තුළ කොළඹ නගරය ආශ්රිත ව",
"මෑතකාලීන ව රට මුහුණ දුන් අභියෝගාත්මකම ආර්ථික කාරණාව ණය ප්රතිව්යුගතකරණය බව"
]
# Initialize the romanizer
romanizer = Romanizer(char_mapper_fp=None, tokenizer_vocab_path=None)
# Romanize the texts
romanized_texts = romanizer(texts)
print(romanized_texts)
# Output:
# ['hello, me masaye gatha wu dina 15ka kalaya thula kolaba nagaraya ashritha wa',
# 'methakaleena wa rata muhuna dun abhiyogathmakama arthika karanawa naya prathiwyugathakaranaya bawa']
```
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgements
- Thanks to all contributors who have helped with the development of Sinlib
- Special thanks to the Sinhala NLP community for their support and feedback
Raw data
{
"_id": null,
"home_page": null,
"name": "sinlib",
"maintainer": null,
"docs_url": null,
"requires_python": "<=3.12,>=3.9.7",
"maintainer_email": null,
"keywords": "NLP, Sinhala, python",
"author": null,
"author_email": "Ransaka <ransaka.ravihara@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/88/ca/23c65f7128e3227d27a3c34198b59e8a196f8e92e02937ad4a005bb2e181/sinlib-0.1.10.tar.gz",
"platform": null,
"description": "\n# SINLIB\n\n<div align=\"center\">\n\n\n\n[](https://badge.fury.io/py/sinlib)\n[](https://pypi.org/project/sinlib/)\n[](https://opensource.org/licenses/MIT)\n\nA Python library for Sinhala text processing and analysis\n</div>\n\n> **Note:** The `Romanizer` and `Transliterator` modules are temporarily unavailable due to a potential bug. We are working to resolve this and will restore them in a future update.\n\n## Overview\n\nSinlib is a specialized Python library designed for processing and analyzing Sinhala text. It provides tools for tokenization, preprocessing, and romanization to facilitate natural language processing tasks for the Sinhala language.\n\n## Features\n\n- **Tokenizer**: Tokenization for Sinhala text\n- **Preprocessor**: Text preprocessing utilities including Sinhala character ratio analysis\n- **Romanizer**: Convert Sinhala text to Roman characters\n\n## Installation\n\nInstall the latest stable version from PyPI:\n\n```bash\npip install sinlib\n```\n\n## Usage Examples\n\n### Tokenizer\n\nSplit Sinhala text into meaningful tokens:\n\n```python\nfrom sinlib import Tokenizer\n\n# Sample Sinhala text\ncorpus = \"\"\"\u0db8\u0dda \u0d85\u0dad\u0dbb, \u0db4\u0dd9\u0db6\u0dbb\u0dc0\u0dcf\u0dbb\u0dd2 \u0db8\u0dcf\u0dc3\u0dba\u0dda \u0db4\u0dc5\u0db8\u0dd4 \u0daf\u0dd2\u0db1 08 \u0dad\u0dd4\u0dc5 \u0db4\u0db8\u0dab\u0d9a\u0dca \u0dc0\u0dd2\u0daf\u0dd9\u0dc3\u0dca \u0dc3\u0d82\u0da0\u0dcf\u0dbb\u0d9a\u0dba\u0db1\u0dca 60,122 \u0daf\u0dd9\u0db1\u0dd9\u0d9a\u0dd4 \u0db8\u0dd9\u0dbb\u0da7\u0da7 \u0db4\u0dd0\u0db8\u0dd2\u0dab \u0dad\u0dd2\u0db6\u0dda.\n\u0d92 \u0d85\u0db1\u0dd4\u0dc0 \u0db8\u0dda \u0dc0\u0dc3\u0dbb\u0dda \u0d9c\u0dad \u0dc0\u0dd6 \u0d9a\u0dcf\u0dbd\u0dba \u0dad\u0dd4\u0dc5 \u0dc3\u0d82\u0da0\u0dcf\u0dbb\u0d9a\u0dba\u0db1\u0dca 268\u200d,375 \u0daf\u0dd9\u0db1\u0dd9\u0d9a\u0dd4 \u0daf\u0dd2\u0dc0\u0dba\u0dd2\u0db1\u0da7 \u0db4\u0dd0\u0db8\u0dd2\u0dab \u0d87\u0dad\u0dd2 \u0db6\u0dc0 \u0dc3\u0d82\u0da0\u0dcf\u0dbb\u0d9a \u0dc3\u0d82\u0dc0\u0dbb\u0dca\u0db0\u0db1 \u0d85\u0db0\u0dd2\u0d9a\u0dcf\u0dbb\u0dd2\u0dba \u0dc3\u0db3\u0dc4\u0db1\u0dca \u0d9a\u0dbb\u0dba\u0dd2.\n\u0d89\u0db1\u0dca \u0dc0\u0dd0\u0da9\u0dd2 \u0db8 \u0dc3\u0d82\u0da0\u0dcf\u0dbb\u0d9a\u0dba\u0db1\u0dca \u0db4\u0dd2\u0dbb\u0dd2\u0dc3\u0d9a\u0dca \u0d89\u0db1\u0dca\u0daf\u0dd2\u0dba\u0dcf\u0dc0\u0dd9\u0db1\u0dca \u0db4\u0dd0\u0db8\u0dd2\u0dab \u0d87\u0dad\u0dd2 \u0d85\u0dad\u0dbb, \u0d91\u0db8 \u0dc3\u0d82\u0d9b\u0dca\u200d\u0dba\u0dcf\u0dc0 42,768\u0d9a\u0dd2.\n\u0d8a\u0da7 \u0d85\u0db8\u0dad\u0dbb \u0dc0 \u0dbb\u0dd4\u0dc3\u0dd2\u0dba\u0dcf\u0dc0\u0dd9\u0db1\u0dca \u0dc3\u0d82\u0da0\u0dcf\u0dbb\u0d9a\u0dba\u0db1\u0dca 39,914\u0d9a\u0dca, \u0db6\u0dca\u200d\u0dbb\u0dd2\u0dad\u0dcf\u0db1\u0dca\u200d\u0dba\u0dba\u0dd9\u0db1\u0dca 22,278\u0d9a\u0dca \u0dc3\u0dc4 \u0da2\u0dbb\u0dca\u0db8\u0db1\u0dd2\u0dba\u0dd9\u0db1\u0dca \u0dc3\u0d82\u0da0\u0dcf\u0dbb\u0d9a\u0dba\u0db1\u0dca 18,016 \u0daf\u0dd9\u0db1\u0dd9\u0d9a\u0dd4 \u0db4\u0dd0\u0db8\u0dd2\u0dab \u0d87\u0dad\u0dd2 \u0db6\u0dc0 \u0dc0\u0dcf\u0dbb\u0dca\u0dad\u0dcf \u0dc0\u0dda.\"\"\"\n\n# Initialize and train the tokenizer\ntokenizer = Tokenizer()\ntokenizer.train([corpus])\n\n# Encode text into tokens\nencoding = tokenizer(\"\u0db8\u0dda \u0d85\u0dad\u0dbb, \u0db4\u0dd9\u0db6\u0dbb\u0dc0\u0dcf\u0dbb\u0dd2 \u0db8\u0dcf\u0dc3\u0dba\u0dda \u0db4\u0dc5\u0db8\u0dd4\")\n\n# List tokens\ntokens = [tokenizer.token_id_to_token_map[id] for id in encoding]\nprint(tokens)\n# Output: ['\u0db8\u0dda', ' ', '\u0d85', '\u0dad', '\u0dbb', ',', ' ', '\u0db4\u0dd9', '\u0db6', '\u0dbb', '\u0dc0\u0dcf', '\u0dbb\u0dd2', ' ', '\u0db8\u0dcf', '\u0dc3', '\u0dba\u0dda', ' ', '\u0db4', '\u0dc5', '\u0db8\u0dd4']\n```\n\n### Preprocessor\n\nAnalyze Sinhala character ratio in text:\n\n```python\nfrom sinlib.preprocessing import get_sinhala_character_ratio\n\n# Sample sentences with varying Sinhala content\nsentences = [\n '\u0db8\u0dd9\u0dba \u0dc3\u0dd2\u0d82\u0dc4\u0dbd \u0dc0\u0dcf\u0d9a\u0dca\u200d\u0dba\u0d9a\u0dca', # Full Sinhala\n '\u0db8\u0dd9\u0dba \u0dc3\u0dd2\u0d82\u0dc4\u0dbd \u0dc0\u0dcf\u0d9a\u0dca\u200d\u0dba\u0d9a\u0dca \u0dc3\u0db8\u0d9c english character \u0d9a\u0dd3\u0db4\u0dba\u0d9a\u0dca', # Mixed Sinhala and English\n 'This is a complete English sentence' # Full English\n]\n\n# Calculate Sinhala character ratio for each sentence\nratios = get_sinhala_character_ratio(sentences)\nprint(ratios)\n# Output: [0.9, 0.46875, 0.0]\n```\n\n### Spell Checker (beta)\n\nDetect typos and get spelling suggestions for Sinhala words using n gram models:\n```python \nfrom sinlib.spellcheck import TypoDetector\n\n# Initialize the typo detector\ntypo_detector = TypoDetector()\n\n# Check spelling of a word\nresult = typo_detector(\"\u0d85\u0da9\u0dd2\u0dbb\u0dcf\u0da2\u0dba\u0dcf\u0d9c\u0dda\")\nprint(result) # ['\u0d85\u0db0\u0dd2\u0dbb\u0dcf\u0da2\u0dba\u0dcf\u0d9c\u0dda', '\u0d85\u0db0\u0dd2\u0dbb\u0dcf\u0da2\u0dca\\u200d\u0dba\u0dba\u0dcf\u0d9c\u0dda', '\u0d85\u0db0\u0dd2\u0dbb\u0dcf\u0da2\u0dba\u0dcf']\n# Output: Either the word itself if correct, or a list of suggestions if it's a potential typo\n```\n\n### Romanizer\n\nConvert Sinhala text to Roman characters:\n\n```python\nfrom sinlib import Romanizer\n\n# Sample texts with Sinhala content\ntexts = [\n \"hello, \u0db8\u0dda \u0db8\u0dcf\u0dc3\u0dba\u0dda \u0d9c\u0dad \u0dc0\u0dd6 \u0daf\u0dd2\u0db1 15\u0d9a \u0d9a\u0dcf\u0dbd\u0dba \u0dad\u0dd4\u0dc5 \u0d9a\u0ddc\u0dc5\u0db9 \u0db1\u0d9c\u0dbb\u0dba \u0d86\u0dc1\u0dca\u200d\u0dbb\u0dd2\u0dad \u0dc0\",\n \"\u0db8\u0dd1\u0dad\u0d9a\u0dcf\u0dbd\u0dd3\u0db1 \u0dc0 \u0dbb\u0da7 \u0db8\u0dd4\u0dc4\u0dd4\u0dab \u0daf\u0dd4\u0db1\u0dca \u0d85\u0db7\u0dd2\u0dba\u0ddd\u0d9c\u0dcf\u0dad\u0dca\u0db8\u0d9a\u0db8 \u0d86\u0dbb\u0dca\u0dae\u0dd2\u0d9a \u0d9a\u0dcf\u0dbb\u0dab\u0dcf\u0dc0 \u0dab\u0dba \u0db4\u0dca\u200d\u0dbb\u0dad\u0dd2\u0dc0\u0dca\u200d\u0dba\u0dd4\u0d9c\u0dad\u0d9a\u0dbb\u0dab\u0dba \u0db6\u0dc0\"\n]\n\n# Initialize the romanizer\nromanizer = Romanizer(char_mapper_fp=None, tokenizer_vocab_path=None)\n\n# Romanize the texts\nromanized_texts = romanizer(texts)\nprint(romanized_texts)\n# Output:\n# ['hello, me masaye gatha wu dina 15ka kalaya thula kolaba nagaraya ashritha wa',\n# 'methakaleena wa rata muhuna dun abhiyogathmakama arthika karanawa naya prathiwyugathakaranaya bawa']\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add some amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgements\n\n- Thanks to all contributors who have helped with the development of Sinlib\n- Special thanks to the Sinhala NLP community for their support and feedback\n",
"bugtrack_url": null,
"license": "MIT License\n \n Copyright (c) [2024] [Ransaka Ravihara]\n \n Permission is hereby granted, free of charge, to any person obtaining a copy\n of this software and associated documentation files (the \"Software\"), to deal\n in the Software without restriction, including without limitation the rights\n to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n copies of the Software, and to permit persons to whom the Software is\n furnished to do so, subject to the following conditions:\n \n The above copyright notice and this permission notice shall be included in all\n copies or substantial portions of the Software.\n \n THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n SOFTWARE.",
"summary": "Sinhala NLP Toolkit",
"version": "0.1.10",
"project_urls": {
"Code": "https://github.com/Ransaka/sinlib",
"Docs": "https://github.com/Ransaka/sinlib"
},
"split_keywords": [
"nlp",
" sinhala",
" python"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "c84bf23d5d3ebc7434bdc0812d3970c32818579f8d16e8bbbba969aab21019bc",
"md5": "668f3326dfa4e6c0bbb61b32e463cb43",
"sha256": "26910a576dc2dff33837835b42778a1863d4b483ea7e347bfeb34e468cfa4f3b"
},
"downloads": -1,
"filename": "sinlib-0.1.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "668f3326dfa4e6c0bbb61b32e463cb43",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<=3.12,>=3.9.7",
"size": 4221234,
"upload_time": "2025-08-10T04:54:57",
"upload_time_iso_8601": "2025-08-10T04:54:57.141580Z",
"url": "https://files.pythonhosted.org/packages/c8/4b/f23d5d3ebc7434bdc0812d3970c32818579f8d16e8bbbba969aab21019bc/sinlib-0.1.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "88ca23c65f7128e3227d27a3c34198b59e8a196f8e92e02937ad4a005bb2e181",
"md5": "37ec2a9134767282c72de2329306a988",
"sha256": "2bdd241486d773f0e8046f5d0207a99a23a29e6109bcff1b21b05c2e142fd67f"
},
"downloads": -1,
"filename": "sinlib-0.1.10.tar.gz",
"has_sig": false,
"md5_digest": "37ec2a9134767282c72de2329306a988",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<=3.12,>=3.9.7",
"size": 4367153,
"upload_time": "2025-08-10T04:54:58",
"upload_time_iso_8601": "2025-08-10T04:54:58.485015Z",
"url": "https://files.pythonhosted.org/packages/88/ca/23c65f7128e3227d27a3c34198b59e8a196f8e92e02937ad4a005bb2e181/sinlib-0.1.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-10 04:54:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Ransaka",
"github_project": "sinlib",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "numpy",
"specs": [
[
">=",
"1.24.0"
]
]
},
{
"name": "torch",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.64.1"
]
]
},
{
"name": "huggingface_hub",
"specs": [
[
"==",
"0.26.2"
]
]
}
],
"lcname": "sinlib"
}