Amharic Segmenter and tokenizer
-------------------------------
This is a simple script that split an Amharic document into different
sentences and tokenes. If you find an issue, please let us know in the
GitHub `Issues <https://github.com/uhh-lt/amharicprocessor/issues>`__
The Segmenter is part of the ``Semantic Models for Amharic`` Project
|image0|
Usage
-------
Install the segmenter: ``pip install amseg``
Tokenization and Segmentation
-------------------------------
Use the following code for sentence segmentation and word tokenization
::
from amseg.amharicSegmenter import AmharicSegmenter
sent_punct = []
word_punct = []
segmenter = AmharicSegmenter(sent_punct,word_punct)
words = segmenter.amharic_tokenizer("እአበበ በሶ በላ።")
sentences = segmenter.tokenize_sentence("እአበበ በሶ በላ። ከበደ ጆንያ፤ ተሸከመ፡!ለምን?")
Outputs
words = ['እአበበ', 'በሶ', 'በላ', '።']
sentences = ['እአበበ በሶ በላ።', 'ከበደ ጆንያ፤ ተሸከመ፡!', 'ለምን?']
Romanization and Normalization
-------------------------------
The following code show cases how to normalize and romanize a given Amharic text
::
from amseg.amharicNormalizer import AmharicNormalizer as normalizer
from amseg.amharicRomanizer import AmharicRomanizer as romanizer
normalized = normalizer.normalize('ሑለት ሦስት')
romanized = romanizer.romanize('ሑለት ሦስት')
Outputs
> normalized = 'ሁለት ሶስት'
> romanized = 'ḥulat śosət'
Transliteration to Amharic Fidel
---------------------------------
The following code show cases how to transliterate a given latin script text to Amahric Fidel script text
::
from amseg.amharicTranslitrator import AmharicTranslitrator as transliterator
transliterated = transliterator.transliterate('misa belah')
Outputs
> transliterated = 'ሚሳ በላህ'
Publications
------------
To cite the Amharic segmenter/tokenizer tool, use the following
`paper <https://www.mdpi.com/1999-5903/13/11/275>`__
::
@Article{fi13110275,
AUTHOR = {Yimam, Seid Muhie and Ayele, Abinew Ali and Venkatesh, Gopalakrishnan and Gashaw, Ibrahim and Biemann, Chris},
TITLE = {Introducing Various Semantic Models for Amharic: Experimentation and Evaluation with Multiple Tasks and Datasets},
JOURNAL = {Future Internet},
VOLUME = {13},
YEAR = {2021},
NUMBER = {11},
ARTICLE-NUMBER = {275},
URL = {https://www.mdpi.com/1999-5903/13/11/275},
ISSN = {1999-5903},
DOI = {10.3390/fi13110275}
}
.. |image0| image:: https://github.com/uhh-lt/amharicmodels/raw/master/logo.png
:target: https://github.com/uhh-lt/amharicmodels/
Raw data
{
"_id": null,
"home_page": "https://github.com/uhh-lt/amharicprocessor",
"name": "amseg",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "Amharic,Amharic sentence splitter,Amharic document normalizer,Amharic Latin to Fidel Transliteration",
"author": "Seid Muhie Yimam",
"author_email": "seid.muhie.yimam@uni-hamburg.de",
"download_url": "https://files.pythonhosted.org/packages/b1/66/4f9442687a9a0f7c3d7a7e0013e60b8d400ad758903690f6535f31490f75/amseg-2.3.tar.gz",
"platform": null,
"description": "Amharic Segmenter and tokenizer\n-------------------------------\n\nThis is a simple script that split an Amharic document into different\nsentences and tokenes. If you find an issue, please let us know in the\nGitHub `Issues <https://github.com/uhh-lt/amharicprocessor/issues>`__\n\nThe Segmenter is part of the ``Semantic Models for Amharic`` Project\n|image0|\n\nUsage \n-------\nInstall the segmenter: ``pip install amseg``\n\nTokenization and Segmentation\n-------------------------------\nUse the following code for sentence segmentation and word tokenization\n\n::\n\n from amseg.amharicSegmenter import AmharicSegmenter\n sent_punct = [] \n word_punct = [] \n segmenter = AmharicSegmenter(sent_punct,word_punct) \n words = segmenter.amharic_tokenizer(\"\u12a5\u12a0\u1260\u1260 \u1260\u1236 \u1260\u120b\u1362\") \n sentences = segmenter.tokenize_sentence(\"\u12a5\u12a0\u1260\u1260 \u1260\u1236 \u1260\u120b\u1362 \u12a8\u1260\u12f0 \u1306\u1295\u12eb\u1364 \u1270\u1238\u12a8\u1218\u1361!\u1208\u121d\u1295?\")\n\nOutputs\n\n words = ['\u12a5\u12a0\u1260\u1260', '\u1260\u1236', '\u1260\u120b', '\u1362']\n\n sentences = ['\u12a5\u12a0\u1260\u1260 \u1260\u1236 \u1260\u120b\u1362', '\u12a8\u1260\u12f0 \u1306\u1295\u12eb\u1364 \u1270\u1238\u12a8\u1218\u1361!', '\u1208\u121d\u1295?']\n\nRomanization and Normalization\n-------------------------------\nThe following code show cases how to normalize and romanize a given Amharic text\n\n::\n\n from amseg.amharicNormalizer import AmharicNormalizer as normalizer\n from amseg.amharicRomanizer import AmharicRomanizer as romanizer\n normalized = normalizer.normalize('\u1211\u1208\u1275 \u1226\u1235\u1275')\n romanized = romanizer.romanize('\u1211\u1208\u1275 \u1226\u1235\u1275')\n\nOutputs \n > normalized = '\u1201\u1208\u1275 \u1236\u1235\u1275' \n > romanized = '\u1e25ulat \u015bos\u0259t'\n\nTransliteration to Amharic Fidel\n---------------------------------\nThe following code show cases how to transliterate a given latin script text to Amahric Fidel script text\n\n::\n\n from amseg.amharicTranslitrator import AmharicTranslitrator as transliterator\n transliterated = transliterator.transliterate('misa belah')\n\nOutputs \n > transliterated = '\u121a\u1233 \u1260\u120b\u1205' \n\nPublications\n------------\n\nTo cite the Amharic segmenter/tokenizer tool, use the following\n`paper <https://www.mdpi.com/1999-5903/13/11/275>`__\n\n::\n\n @Article{fi13110275,\n AUTHOR = {Yimam, Seid Muhie and Ayele, Abinew Ali and Venkatesh, Gopalakrishnan and Gashaw, Ibrahim and Biemann, Chris},\n TITLE = {Introducing Various Semantic Models for Amharic: Experimentation and Evaluation with Multiple Tasks and Datasets},\n JOURNAL = {Future Internet},\n VOLUME = {13},\n YEAR = {2021},\n NUMBER = {11},\n ARTICLE-NUMBER = {275},\n URL = {https://www.mdpi.com/1999-5903/13/11/275},\n ISSN = {1999-5903},\n DOI = {10.3390/fi13110275}\n }\n\n.. |image0| image:: https://github.com/uhh-lt/amharicmodels/raw/master/logo.png\n :target: https://github.com/uhh-lt/amharicmodels/\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "This is an Amharic document segmentation and normalization tool",
"version": "2.3",
"project_urls": {
"Download": "https://github.com/uhh-lt/amharicprocessor/archive/rehistoryfs/tags/v_23.tar.gz",
"Homepage": "https://github.com/uhh-lt/amharicprocessor"
},
"split_keywords": [
"amharic",
"amharic sentence splitter",
"amharic document normalizer",
"amharic latin to fidel transliteration"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b1664f9442687a9a0f7c3d7a7e0013e60b8d400ad758903690f6535f31490f75",
"md5": "e638ffb03e91e5fe48bb634b6e0af4f8",
"sha256": "1d51cce1ca9b00b365b33fad5089e2660547be258750efbfb2e9658c53cff599"
},
"downloads": -1,
"filename": "amseg-2.3.tar.gz",
"has_sig": false,
"md5_digest": "e638ffb03e91e5fe48bb634b6e0af4f8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 11542,
"upload_time": "2023-05-03T19:29:00",
"upload_time_iso_8601": "2023-05-03T19:29:00.849769Z",
"url": "https://files.pythonhosted.org/packages/b1/66/4f9442687a9a0f7c3d7a7e0013e60b8d400ad758903690f6535f31490f75/amseg-2.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-05-03 19:29:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "uhh-lt",
"github_project": "amharicprocessor",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "amseg"
}