# AsoSoft Library (Python)
AsoSoft Library offers basic natural language processing (NLP) algorithms for the Kurdish Language (ckb: Central branch of Kurdish).
AsoSoft Library is originally written in C# and this library is the Python port.
- **Grapheme-to-Phoneme (G2P) converter and Transliteration**: converts Kurdish text into syllabified phoneme string. Also transliterates Kurdish texts from Arabic script into Latin script and vice versa.
- **Normalizer**: normalizes the Kurdish text and punctuation marks, unifies numerals, replaces Html Entities, extracts and replaces URLs and emails, and more.
- **Numeral Converter**: converts any type of numbers into Kurdish words.
- **Sort**: Sorts a list in correct Kurdish alphabet order.
- **Poem Meter Classifier**: Classifies the meter of the input Kurdish poem
## How to use?
- **Python version**: Python 3.11+
- **Install package using pip**: [pip install asosoft](https://pypi.org/project/asosoft/)
- **Import package in your py file**:
```python
import asosoft
```
## Development
AsoSoft Library is developed and maintained by Aso Mahmudi.
AsoSoft Library is written in C# (.NET 6).
## Grapheme-to-Phoneme (G2P) converter and Transliteration
This function is based on the study "[Automated Grapheme-to-Phoneme Conversion for Central Kurdish based on Optimality Theory](https://www.sciencedirect.com/science/article/abs/pii/S0885230821000292)".
### Kurdish G2P converter
Converts Central Kurdish text in standard Arabic script into **syllabified phonemic** Latin script (i.e. graphemes to phonems)
General format:
```python
asosoft.KurdishG2P(text, convertNumbersToWord = False, backMergeConjunction = True, singleOutputPerWord = True)
```
An example:
```python
>>> print(asosoft.KurdishG2P("شەو و ڕۆژ بووین بە گرفت. درێژیی دیوارەکەی گرتن"))
ˈşeˈwû ˈřoj ˈbûyn ˈbe ˈgiˈrift. ˈdiˈrêˈjîy ˈdîˈwaˈreˈkey ˈgirˈtin
```
### Transliteration
Arabic script into Hawar Latin script (حغڕڵ→ḧẍřł):
```python
>>> print(asosoft.Ar2La("گیرۆدەی خاڵی ڕەشتە؛ گوێت لە نەغمەی تویوورە؟"))
gîrodey xałî řeşte; gwêt le neẍmey tuyûre?
```
Arabic script into the Latin script suggested by Dr. Feryad Fazil Omar:
```python
>>> print(asosoft.Ar2LaF("گیرۆدەی خاڵی ڕەشتە؛ گوێت لە نەغمەی تویوورە؟"))
gîrodey xaḻî ṟeşte; gwêt le nex̱mey tuyûre?
```
Arabic script into simplified (حغڕڵ→hxrl) Hawar Latin script:
```python
>>> print(asosoft.Ar2LaSimple("گیرۆدەی خاڵی ڕەشتە؛ گوێت لە نەغمەی تویوورە؟"))
gîrodey xalî reşte; gwêt le nexmey tuyûre?
```
Latin script (Hawar) into Arabic script:
```python
>>> print(asosoft.La2Ar("Gelî keç û xortên kurdan, hûn hemû bi xêr biçin"))
گەلی کەچ و خۆرتێن کوردان، هوون هەموو ب خێر بچن
```
Arabic script into IPA:
```python
>>> print(asosoft.Phonemes2IPA(asosoft.KurdishG2P("شەو و ڕۆژ بووین بە گرفت. درێژیی دیوارەکە گرتن")))
ʃa·wu ro̞ʒ bujn ba gɪ·ɾɪft. dɪ·ɾɛ·ʒij di·wä·ɾa·ka gɪɾ·tɪn
```
## Kurdish Text Normalizer
Several functions needed for Central Kurdish text normalization:
### Normalize Kurdish
Two character replacement lists are provided as the resources of the library:
- Deep Unicode Corrections:
- replacing deprecated Arabic Presentation Forms (FB50–FDFF and FE70–FEFF) with corresponding standard characters.
- replacing different types of dashes and spaces
- removing Unicode control character
- Additional Unicode Corrections
- replacing special Arabic math signs with corresponding Latin characters
- replacing similar, but different letters with standard characters (e.g. ڪ,ے,ٶ with ک,ی,ؤ)
The normalization task in this function:
- for all Arabic scripts (including Kurdish, Arabic, and Persian):
- Character-based replacement:
- Above mentioned replacement lists
- Private Use Area (U+E000 to U+F8FF) with White Square character
- Standardizing and removing duplicated or unnecessary Zero-Width characters
- removing unnecessary Tatweels (U+0640)
- only for Central Kurdish:
- standardizing Kurdish characters: ە, هـ, ی, and ک
- correcting miss-converted characters from non-Unicode fonts
- replacing word-initial ر with ڕ
the simple overloading:
```python
>>> print(asosoft.Normalize("دەقے شیَعري خـــۆش. رهنگهكاني خاك"))
دەقی شێعری خۆش. ڕەنگەکانی خاک
```
or the complete overloading:
```python
>>> asosoft.Normalize(text, isOnlyKurdish=True, changeInitialR=True, deepUnicodeCorrectios=True, additionalUnicodeCorrections=True, usersReplaceList)
```
### AliK to Unicode
`AliK2Unicode` converts Kurdish text written in AliK fonts (developed by Abas Majid in 1997) into Unicode standard. Ali-K fonts: *Alwand, Azzam, Hasan, Jiddah, kanaqen, Khalid, Sahifa, Sahifa Bold, Samik, Sayid, Sharif, Shrif Bold, Sulaimania, Traditional*
```python
>>> print(asosoft.AliK2Unicode("ئاشناكردنى خويَندكار بة طوَرِانكاريية كوَمةلاَيةتييةكان"))
ئاشناکردنی خوێندکار بە گۆڕانکارییە کۆمەڵایەتییەکان
```
### AliWeb to Unicode
`AliWeb2Unicode` converts Kurdish text written in AliK fonts into Unicode standard. Ali-Web fonts: *Malper, Malper Bold, Samik, Traditional, Traditional Bold*
```python
>>> print(asosoft.AliWeb2Unicode("هةر جةرةيانصکي مصذووُيي کة أوو دةدا"))
ھەر جەرەیانێکی مێژوویی کە ڕوو دەدا
```
### Dylan to Unicode
`Dylan2Unicode` converts Kurdish text written in Dylan fonts (developed by Dylan Saleh at [KurdSoft]( https://web.archive.org/web/20020528231610/http://www.kurdsoft.com/) in 2001) into Unicode standard.
```python
>>> print(asosoft.Dylan2Unicode("لثكؤلثنةران بؤيان دةركةوتووة كة دةتوانث بؤ لةش بةكةصك بث"))
لێکۆلێنەران بۆیان دەرکەوتووە کە دەتوانێ بۆ لەش بەکەڵک بێ
```
### Zarnegar to Unicode
`Zarnegar2Unicode` converts Kurdish text written in Zarnegar word processor (developed by [SinaSoft](http://www.sinasoft.com/fa/zarnegar.html) with RDF converter by [NoorSoft](https://www.noorsoft.org/fa/software/view/6561)) and into Unicode standard.
```python
>>> print(asosoft.Zarnegar2Unicode("بلٌيٌين و بگهرٍيٌين بوٌ ههلاٌلٌهى سىٌيهمى فهلسهفه"))
بڵێین و بگەڕێین بۆ هەڵاڵەی سێیەمی فەلسەفە
```
### NormalizePunctuations
`NormalizePunctuations` corrects spaces before and after of the punctuations. When `seprateAllPunctuations` is true,
```python
>>> print(asosoft.NormalizePunctuations("دەقی«کوردی » و ڕێنووس ،((خاڵبەندی )) چۆنە ؟", false))
دەقی «کوردی» و ڕێنووس، «خاڵبەندی» چۆنە؟
```
### Trim Line
Trim starting and ending white spaces (including zero width spaces) of line,
`TrimLine`
```python
>>> print(TrimLine(" دەق\u200c "))
دەق
```
### Replace Html Entities
`ReplaceHtmlEntity` replaces HTML Entities with single Unicode characters (e.g. "é" with "é"). It is useful in web crawled corpora.
```python
>>> print(asosoft.ReplaceHtmlEntity("ئێوە "دەق" لە زمانی <کوردی> دەنووسن"))
ئێوە "دەق" بە زمانی <کوردی> دەنووسن
```
### Replace URLs and emails
`ReplaceUrlEmail` replaces URLs and emails with a certain word. It improves language models.
### Unify Numerals
`UnifyNumerals` unifies numeral characters into desired numeral type from `en` (0123456789) or `ar` (٠١٢٣٤٥٦٧٨٩)
```python
>>> print(asosoft.UnifyNumerals("ژمارەکانی ٤٥٦ و ۴۵۶ و 456", "en"))
ژمارەکانی 456 و 456 و 456
```
### Seperate Digits from words
`SeperateDigits` add a space between joined numerals and words (e.g. replacing "12کەس" with "12 کەس"). It improves language models.
```python
>>> print(asosoft.SeperateDigits("لە ساڵی1950دا1000دۆلاریان بە 5کەس دا"))
لە ساڵی 1950 دا 1000 دۆلاریان بە 5 کەس دا
```
### Word to Word Replacment
`Word2WordReplacement` applies a "string to string" replacement dictionary on the text. It replaces the full-matched words not a part of them.
```python
>>> dict = {"مال": "ماڵ", "سلاو": "سڵاو"}
>>> print(asosoft.Word2WordReplacement("مال، نووری مالیکی", dict))
ماڵ، نووری مالیکی
```
### Character to Character Replacment
`Char2CharReplacment` applies a "char to char" replacement dictionary on the text. It uses as the final step needed for some non-Unicode systems.
## Kurdish Numeral converter
It converts numerals into Central Kurdish words. It is useful in text-to-speech tools.
- integers (1100 => )
- floats (10.11)
- negatives (-10.11)
- percent (100% or %100)
- querency marks ($100, £100, and €100)
```python
>>> print(asosoft.Number2Word("لە ساڵی 1999دا بڕی 40% لە پارەکەیان واتە $102.1 یان وەرگرت"))
لە ساڵی هەزار و نۆسەد و نەوەد و نۆدا بڕی چل لە سەد لە پارەکەیان واتە سەد و دوو پۆینت یەک دۆلاریان وەرگرت
```
## Kurdish Sort
Sorting a string list in correct order of Kurdish alphabet ("ئءاآأإبپتثجچحخدڎذرڕزژسشصضطظعغفڤقكکگلڵمنوۆۊۉهھەیێ")
```python
>>> myList = ["یەک", "ڕەنگ", "ئەو", "ئاو", "ڤەژین", "فڵان"]
>>> print(asosoft.KurdishSort(myList))
"ئاو", "ئەو", "ڕەنگ", "فڵان", "ڤەژین", "یەک"
```
or using your custom order:
```python
>>> inputList = ["یەک", "ڕەنگ", "ئەو", "ئاو", "ڤەژین", "فڵان"]
>>> inputOrder = list(["ئءاآأإبپتثجچحخدڎڊذرڕزژسشصضطظعغفڤقكکگڴلڵمنوۆۊۉۋهھەیێ"])
>>> print(asosoft.CustomSort(inputList, inputOrder))
```
## Poem Meter Classifier
It classifies the meter of the input Kurdish poem typed in Arabic script. The lines of the poem should be seprated by new line char ('\n').
You can find Kurdish poems in https://books.vejin.net/.
```python
>>> poem = "گەرچی تووشی ڕەنجەڕۆیی و حەسرەت و دەردم ئەمن\nقەت لەدەس ئەم چەرخە سپڵە نابەزم مەردم ئەمن\nئاشقی چاوی کەژاڵ و گەردنی پڕ \nخاڵ نیم\nئاشقی کێو و تەلان و بەندەن و بەردم ئەمن"
>>> classified = asosoft.ClassifyKurdishPoem(poem)
>>> print("Poem Type= " + classified.overalMeterType)
>>> print("Poem Meter= " + classified.overalPattern)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/AsoSoft/AsoSoft-Library-py",
"name": "asosoft",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "natural-language-processing, normalization, unicode-normalization, central-kurdish, kurdish, sorani",
"author": "Aso Mahmudi",
"author_email": "aso.mehmudi@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/c1/d6/00fb8029863e349a728bfb8d355dca17112f1932a952bcdab41ec03b6490/asosoft-0.1.3.tar.gz",
"platform": null,
"description": "# AsoSoft Library (Python)\r\nAsoSoft Library offers basic natural language processing (NLP) algorithms for the Kurdish Language (ckb: Central branch of Kurdish).\r\nAsoSoft Library is originally written in C# and this library is the Python port.\r\n- **Grapheme-to-Phoneme (G2P) converter and Transliteration**: converts Kurdish text into syllabified phoneme string. Also transliterates Kurdish texts from Arabic script into Latin script and vice versa.\r\n- **Normalizer**: normalizes the Kurdish text and punctuation marks, unifies numerals, replaces Html Entities, extracts and replaces URLs and emails, and more.\r\n- **Numeral Converter**: converts any type of numbers into Kurdish words.\r\n- **Sort**: Sorts a list in correct Kurdish alphabet order.\r\n- **Poem Meter Classifier**: Classifies the meter of the input Kurdish poem \r\n\r\n## How to use?\r\n- **Python version**: Python 3.11+ \r\n- **Install package using pip**: [pip install asosoft](https://pypi.org/project/asosoft/)\r\n- **Import package in your py file**: \r\n```python\r\nimport asosoft\r\n```\r\n\r\n## Development\r\nAsoSoft Library is developed and maintained by Aso Mahmudi.\r\nAsoSoft Library is written in C# (.NET 6).\r\n\r\n## Grapheme-to-Phoneme (G2P) converter and Transliteration\r\nThis function is based on the study \"[Automated Grapheme-to-Phoneme Conversion for Central Kurdish based on Optimality Theory](https://www.sciencedirect.com/science/article/abs/pii/S0885230821000292)\". \r\n\r\n### Kurdish G2P converter\r\nConverts Central Kurdish text in standard Arabic script into **syllabified phonemic** Latin script (i.e. graphemes to phonems)\r\n\r\nGeneral format:\r\n```python\r\nasosoft.KurdishG2P(text, convertNumbersToWord = False, backMergeConjunction = True, singleOutputPerWord = True)\r\n```\r\nAn example:\r\n```python\r\n>>> print(asosoft.KurdishG2P(\"\u0634\u06d5\u0648 \u0648 \u0695\u06c6\u0698 \u0628\u0648\u0648\u06cc\u0646 \u0628\u06d5 \u06af\u0631\u0641\u062a. \u062f\u0631\u06ce\u0698\u06cc\u06cc \u062f\u06cc\u0648\u0627\u0631\u06d5\u06a9\u06d5\u06cc \u06af\u0631\u062a\u0646\"))\r\n\u02c8\u015fe\u02c8w\u00fb \u02c8\u0159oj \u02c8b\u00fbyn \u02c8be \u02c8gi\u02c8rift. \u02c8di\u02c8r\u00ea\u02c8j\u00eey \u02c8d\u00ee\u02c8wa\u02c8re\u02c8key \u02c8gir\u02c8tin\r\n```\r\n### Transliteration\r\n\r\nArabic script into Hawar Latin script (\u062d\u200c\u063a\u200c\u0695\u06b5\u2192\u1e27\u1e8d\u0159\u0142):\r\n```python\r\n>>> print(asosoft.Ar2La(\"\u06af\u06cc\u0631\u06c6\u062f\u06d5\u06cc \u062e\u0627\u06b5\u06cc \u0695\u06d5\u0634\u062a\u06d5\u061b \u06af\u0648\u06ce\u062a \u0644\u06d5 \u0646\u06d5\u063a\u0645\u06d5\u06cc \u062a\u0648\u06cc\u0648\u0648\u0631\u06d5\u061f\"))\r\ng\u00eerodey xa\u0142\u00ee \u0159e\u015fte; gw\u00eat le ne\u1e8dmey tuy\u00fbre?\r\n```\r\n\r\nArabic script into the Latin script suggested by Dr. Feryad Fazil Omar:\r\n```python\r\n>>> print(asosoft.Ar2LaF(\"\u06af\u06cc\u0631\u06c6\u062f\u06d5\u06cc \u062e\u0627\u06b5\u06cc \u0695\u06d5\u0634\u062a\u06d5\u061b \u06af\u0648\u06ce\u062a \u0644\u06d5 \u0646\u06d5\u063a\u0645\u06d5\u06cc \u062a\u0648\u06cc\u0648\u0648\u0631\u06d5\u061f\"))\r\ng\u00eerodey xa\u1e3b\u00ee \u1e5fe\u015fte; gw\u00eat le nex\u0331mey tuy\u00fbre?\r\n```\r\n\r\nArabic script into simplified (\u062d\u200c\u063a\u200c\u0695\u06b5\u2192hxrl) Hawar Latin script:\r\n```python\r\n>>> print(asosoft.Ar2LaSimple(\"\u06af\u06cc\u0631\u06c6\u062f\u06d5\u06cc \u062e\u0627\u06b5\u06cc \u0695\u06d5\u0634\u062a\u06d5\u061b \u06af\u0648\u06ce\u062a \u0644\u06d5 \u0646\u06d5\u063a\u0645\u06d5\u06cc \u062a\u0648\u06cc\u0648\u0648\u0631\u06d5\u061f\"))\r\ng\u00eerodey xal\u00ee re\u015fte; gw\u00eat le nexmey tuy\u00fbre?\r\n```\r\n\r\nLatin script (Hawar) into Arabic script:\r\n```python\r\n>>> print(asosoft.La2Ar(\"Gel\u00ee ke\u00e7 \u00fb xort\u00ean kurdan, h\u00fbn hem\u00fb bi x\u00ear bi\u00e7in\"))\r\n\u06af\u06d5\u0644\u06cc \u06a9\u06d5\u0686 \u0648 \u062e\u06c6\u0631\u062a\u06ce\u0646 \u06a9\u0648\u0631\u062f\u0627\u0646\u060c \u0647\u0648\u0648\u0646 \u0647\u06d5\u0645\u0648\u0648 \u0628 \u062e\u06ce\u0631 \u0628\u0686\u0646\r\n```\r\n\r\nArabic script into IPA:\r\n```python\r\n>>> print(asosoft.Phonemes2IPA(asosoft.KurdishG2P(\"\u0634\u06d5\u0648 \u0648 \u0695\u06c6\u0698 \u0628\u0648\u0648\u06cc\u0646 \u0628\u06d5 \u06af\u0631\u0641\u062a. \u062f\u0631\u06ce\u0698\u06cc\u06cc \u062f\u06cc\u0648\u0627\u0631\u06d5\u06a9\u06d5 \u06af\u0631\u062a\u0646\")))\r\n\u0283a\u00b7wu ro\u031e\u0292 bujn ba g\u026a\u00b7\u027e\u026aft. d\u026a\u00b7\u027e\u025b\u00b7\u0292ij di\u00b7w\u00e4\u00b7\u027ea\u00b7ka g\u026a\u027e\u00b7t\u026an\r\n```\r\n## Kurdish Text Normalizer\r\nSeveral functions needed for Central Kurdish text normalization:\r\n\r\n### Normalize Kurdish\r\nTwo character replacement lists are provided as the resources of the library:\r\n- Deep Unicode Corrections:\r\n - replacing deprecated Arabic Presentation Forms (FB50\u2013FDFF and FE70\u2013FEFF) with corresponding standard characters.\r\n - replacing different types of dashes and spaces\r\n - removing Unicode control character\r\n- Additional Unicode Corrections\r\n - replacing special Arabic math signs with corresponding Latin characters\r\n - replacing similar, but different letters with standard characters (e.g. \u06aa,\u06d2,\u0676 with \u06a9,\u06cc,\u0624)\r\n\r\nThe normalization task in this function:\r\n- for all Arabic scripts (including Kurdish, Arabic, and Persian):\r\n - Character-based replacement:\r\n - Above mentioned replacement lists\r\n - Private Use Area (U+E000 to U+F8FF) with White Square character\r\n - Standardizing and removing duplicated or unnecessary Zero-Width characters\r\n - removing unnecessary Tatweels (U+0640)\r\n- only for Central Kurdish:\r\n - standardizing Kurdish characters: \u06d5, \u0647\u0640, \u06cc, and \u06a9 \r\n - correcting miss-converted characters from non-Unicode fonts\r\n - replacing word-initial \u0631 with \u0695\r\n\r\nthe simple overloading:\r\n```python\r\n>>> print(asosoft.Normalize(\"\u062f\u06d5\u0642\u06d2 \u0634\u06cc\u064e\u0639\u0631\u064a \u062e\u0640\u0640\u0640\u06c6\u0634. \u0631\u0647\u200c\u0646\u06af\u0647\u200c\u0643\u0627\u0646\u064a \u062e\u0627\u0643\"))\r\n\u062f\u06d5\u0642\u06cc \u0634\u06ce\u0639\u0631\u06cc \u062e\u06c6\u0634. \u0695\u06d5\u0646\u06af\u06d5\u06a9\u0627\u0646\u06cc \u062e\u0627\u06a9\r\n```\r\n\r\nor the complete overloading:\r\n```python\r\n>>> asosoft.Normalize(text, isOnlyKurdish=True, changeInitialR=True, deepUnicodeCorrectios=True, additionalUnicodeCorrections=True, usersReplaceList)\r\n```\r\n\r\n### AliK to Unicode\r\n`AliK2Unicode` converts Kurdish text written in AliK fonts (developed by Abas Majid in 1997) into Unicode standard. Ali-K fonts: *Alwand, Azzam, Hasan, Jiddah, kanaqen, Khalid, Sahifa, Sahifa Bold, Samik, Sayid, Sharif, Shrif Bold, Sulaimania, Traditional*\r\n```python\r\n>>> print(asosoft.AliK2Unicode(\"\u0626\u0627\u0634\u0646\u0627\u0643\u0631\u062f\u0646\u0649 \u062e\u0648\u064a\u064e\u0646\u062f\u0643\u0627\u0631 \u0628\u0629 \u0637\u0648\u064e\u0631\u0650\u0627\u0646\u0643\u0627\u0631\u064a\u064a\u0629 \u0643\u0648\u064e\u0645\u0629\u0644\u0627\u064e\u064a\u0629\u062a\u064a\u064a\u0629\u0643\u0627\u0646\"))\r\n\u0626\u0627\u0634\u0646\u0627\u06a9\u0631\u062f\u0646\u06cc \u062e\u0648\u06ce\u0646\u062f\u06a9\u0627\u0631 \u0628\u06d5 \u06af\u06c6\u0695\u0627\u0646\u06a9\u0627\u0631\u06cc\u06cc\u06d5 \u06a9\u06c6\u0645\u06d5\u06b5\u0627\u06cc\u06d5\u062a\u06cc\u06cc\u06d5\u06a9\u0627\u0646\r\n```\r\n\r\n### AliWeb to Unicode\r\n`AliWeb2Unicode` converts Kurdish text written in AliK fonts into Unicode standard. Ali-Web fonts: *Malper, Malper Bold, Samik, Traditional, Traditional Bold*\r\n```python\r\n>>> print(asosoft.AliWeb2Unicode(\"\u0647\u0629\u0631 \u062c\u0629\u0631\u0629\u064a\u0627\u0646\u0635\u06a9\u064a \u0645\u0635\u0630\u0648\u0648\u064f\u064a\u064a \u06a9\u0629 \u0623\u0648\u0648 \u062f\u0629\u062f\u0627\"))\r\n\u06be\u06d5\u0631 \u062c\u06d5\u0631\u06d5\u06cc\u0627\u0646\u06ce\u06a9\u06cc \u0645\u06ce\u0698\u0648\u0648\u06cc\u06cc \u06a9\u06d5 \u0695\u0648\u0648 \u062f\u06d5\u062f\u0627\r\n```\r\n\r\n### Dylan to Unicode\r\n`Dylan2Unicode` converts Kurdish text written in Dylan fonts (developed by Dylan Saleh at [KurdSoft]( https://web.archive.org/web/20020528231610/http://www.kurdsoft.com/) in 2001) into Unicode standard.\r\n```python\r\n>>> print(asosoft.Dylan2Unicode(\"\u0644\u062b\u0643\u0624\u0644\u062b\u0646\u0629\u0631\u0627\u0646 \u0628\u0624\u064a\u0627\u0646 \u062f\u0629\u0631\u0643\u0629\u0648\u062a\u0648\u0648\u0629 \u0643\u0629 \u062f\u0629\u062a\u0648\u0627\u0646\u062b \u0628\u0624 \u0644\u0629\u0634 \u0628\u0629\u0643\u0629\u0635\u0643 \u0628\u062b\"))\r\n\u0644\u06ce\u06a9\u06c6\u0644\u06ce\u0646\u06d5\u0631\u0627\u0646 \u0628\u06c6\u06cc\u0627\u0646 \u062f\u06d5\u0631\u06a9\u06d5\u0648\u062a\u0648\u0648\u06d5 \u06a9\u06d5 \u062f\u06d5\u062a\u0648\u0627\u0646\u06ce \u0628\u06c6 \u0644\u06d5\u0634 \u0628\u06d5\u06a9\u06d5\u06b5\u06a9 \u0628\u06ce\r\n```\r\n### Zarnegar to Unicode\r\n`Zarnegar2Unicode` converts Kurdish text written in Zarnegar word processor (developed by [SinaSoft](http://www.sinasoft.com/fa/zarnegar.html) with RDF converter by [NoorSoft](https://www.noorsoft.org/fa/software/view/6561)) and into Unicode standard.\r\n```python\r\n>>> print(asosoft.Zarnegar2Unicode(\"\u0628\u0644\u064c\u064a\u064c\u064a\u0646 \u0648 \u0628\u06af\u0647\u200c\u0631\u064d\u064a\u064c\u064a\u0646 \u0628\u0648\u064c \u0647\u0647\u200c\u0644\u0627\u064c\u0644\u064c\u0647\u200c\u0649 \u0633\u0649\u064c\u064a\u0647\u200c\u0645\u0649 \u0641\u0647\u200c\u0644\u0633\u0647\u200c\u0641\u0647\"))\r\n\u0628\u06b5\u06ce\u06cc\u0646 \u0648 \u0628\u06af\u06d5\u0695\u06ce\u06cc\u0646 \u0628\u06c6 \u0647\u06d5\u06b5\u0627\u06b5\u06d5\u06cc \u0633\u06ce\u06cc\u06d5\u0645\u06cc \u0641\u06d5\u0644\u0633\u06d5\u0641\u06d5\r\n```\r\n### NormalizePunctuations\r\n`NormalizePunctuations` corrects spaces before and after of the punctuations. When `seprateAllPunctuations` is true, \r\n```python\r\n>>> print(asosoft.NormalizePunctuations(\"\u062f\u06d5\u0642\u06cc\u00ab\u06a9\u0648\u0631\u062f\u06cc \u00bb \u0648 \u0695\u06ce\u0646\u0648\u0648\u0633 \u060c((\u062e\u0627\u06b5\u0628\u06d5\u0646\u062f\u06cc )) \u0686\u06c6\u0646\u06d5 \u061f\", false))\r\n\u062f\u06d5\u0642\u06cc \u00ab\u06a9\u0648\u0631\u062f\u06cc\u00bb \u0648 \u0695\u06ce\u0646\u0648\u0648\u0633\u060c \u00ab\u062e\u0627\u06b5\u0628\u06d5\u0646\u062f\u06cc\u00bb \u0686\u06c6\u0646\u06d5\u061f\r\n```\r\n### Trim Line\r\nTrim starting and ending white spaces (including zero width spaces) of line,\r\n`TrimLine`\r\n```python\r\n>>> print(TrimLine(\" \u062f\u06d5\u0642\\u200c \"))\r\n\u062f\u06d5\u0642\r\n```\r\n\r\n### Replace Html Entities\r\n`ReplaceHtmlEntity` replaces HTML Entities with single Unicode characters (e.g. \"é\" with \"\u00e9\"). It is useful in web crawled corpora.\r\n```python\r\n>>> print(asosoft.ReplaceHtmlEntity(\"\u0626\u06ce\u0648\u06d5 "\u062f\u06d5\u0642" \u0644\u06d5 \u0632\u0645\u0627\u0646\u06cc <\u06a9\u0648\u0631\u062f\u06cc> \u062f\u06d5\u0646\u0648\u0648\u0633\u0646\"))\r\n\u0626\u06ce\u0648\u06d5 \"\u062f\u06d5\u0642\" \u0628\u06d5 \u0632\u0645\u0627\u0646\u06cc <\u06a9\u0648\u0631\u062f\u06cc> \u062f\u06d5\u0646\u0648\u0648\u0633\u0646\r\n```\r\n### Replace URLs and emails\r\n`ReplaceUrlEmail` replaces URLs and emails with a certain word. It improves language models.\r\n\r\n### Unify Numerals\r\n`UnifyNumerals` unifies numeral characters into desired numeral type from `en` (0123456789) or `ar` (\u0660\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669)\r\n```python\r\n>>> print(asosoft.UnifyNumerals(\"\u0698\u0645\u0627\u0631\u06d5\u06a9\u0627\u0646\u06cc \u0664\u0665\u0666 \u0648 \u06f4\u06f5\u06f6 \u0648 456\", \"en\"))\r\n\u0698\u0645\u0627\u0631\u06d5\u06a9\u0627\u0646\u06cc 456 \u0648 456 \u0648 456\r\n```\r\n\r\n### Seperate Digits from words\r\n`SeperateDigits` add a space between joined numerals and words (e.g. replacing \"12\u06a9\u06d5\u0633\" with \"12 \u06a9\u06d5\u0633\"). It improves language models.\r\n```python\r\n>>> print(asosoft.SeperateDigits(\"\u0644\u06d5 \u0633\u0627\u06b5\u06cc1950\u062f\u06271000\u062f\u06c6\u0644\u0627\u0631\u06cc\u0627\u0646 \u0628\u06d5 5\u06a9\u06d5\u0633 \u062f\u0627\"))\r\n\u0644\u06d5 \u0633\u0627\u06b5\u06cc 1950 \u062f\u0627 1000 \u062f\u06c6\u0644\u0627\u0631\u06cc\u0627\u0646 \u0628\u06d5 5 \u06a9\u06d5\u0633 \u062f\u0627\r\n```\r\n\r\n### Word to Word Replacment\r\n`Word2WordReplacement` applies a \"string to string\" replacement dictionary on the text. It replaces the full-matched words not a part of them.\r\n```python\r\n>>> dict = {\"\u0645\u0627\u0644\": \"\u0645\u0627\u06b5\", \"\u0633\u0644\u0627\u0648\": \"\u0633\u06b5\u0627\u0648\"}\r\n>>> print(asosoft.Word2WordReplacement(\"\u0645\u0627\u0644\u060c \u0646\u0648\u0648\u0631\u06cc \u0645\u0627\u0644\u06cc\u06a9\u06cc\", dict))\r\n\u0645\u0627\u06b5\u060c \u0646\u0648\u0648\u0631\u06cc \u0645\u0627\u0644\u06cc\u06a9\u06cc\r\n```\r\n\r\n### Character to Character Replacment\r\n`Char2CharReplacment` applies a \"char to char\" replacement dictionary on the text. It uses as the final step needed for some non-Unicode systems.\r\n\r\n## Kurdish Numeral converter\r\nIt converts numerals into Central Kurdish words. It is useful in text-to-speech tools.\r\n- integers (1100 => )\r\n- floats (10.11)\r\n- negatives (-10.11)\r\n- percent (100% or %100)\r\n- querency marks ($100, \u00a3100, and \u20ac100)\r\n\r\n```python\r\n>>> print(asosoft.Number2Word(\"\u0644\u06d5 \u0633\u0627\u06b5\u06cc 1999\u062f\u0627 \u0628\u0695\u06cc 40% \u0644\u06d5 \u067e\u0627\u0631\u06d5\u06a9\u06d5\u06cc\u0627\u0646 \u0648\u0627\u062a\u06d5 $102.1 \u06cc\u0627\u0646 \u0648\u06d5\u0631\u06af\u0631\u062a\"))\r\n\u0644\u06d5 \u0633\u0627\u06b5\u06cc \u0647\u06d5\u0632\u0627\u0631 \u0648 \u0646\u06c6\u0633\u06d5\u062f \u0648 \u0646\u06d5\u0648\u06d5\u062f \u0648 \u0646\u06c6\u062f\u0627 \u0628\u0695\u06cc \u0686\u0644 \u0644\u06d5 \u0633\u06d5\u062f \u0644\u06d5 \u067e\u0627\u0631\u06d5\u06a9\u06d5\u06cc\u0627\u0646 \u0648\u0627\u062a\u06d5 \u0633\u06d5\u062f \u0648 \u062f\u0648\u0648 \u067e\u06c6\u06cc\u0646\u062a \u06cc\u06d5\u06a9 \u062f\u06c6\u0644\u0627\u0631\u06cc\u0627\u0646 \u0648\u06d5\u0631\u06af\u0631\u062a\r\n```\r\n\r\n## Kurdish Sort\r\nSorting a string list in correct order of Kurdish alphabet (\"\u0626\u0621\u0627\u0622\u0623\u0625\u0628\u067e\u062a\u062b\u062c\u0686\u062d\u062e\u062f\u068e\u0630\u0631\u0695\u0632\u0698\u0633\u0634\u0635\u0636\u0637\u0638\u0639\u063a\u0641\u06a4\u0642\u0643\u06a9\u06af\u0644\u06b5\u0645\u0646\u0648\u06c6\u06ca\u06c9\u0647\u06be\u06d5\u06cc\u06ce\")\r\n```python\r\n>>> myList = [\"\u06cc\u06d5\u06a9\", \"\u0695\u06d5\u0646\u06af\", \"\u0626\u06d5\u0648\", \"\u0626\u0627\u0648\", \"\u06a4\u06d5\u0698\u06cc\u0646\", \"\u0641\u06b5\u0627\u0646\"]\r\n>>> print(asosoft.KurdishSort(myList))\r\n\"\u0626\u0627\u0648\", \"\u0626\u06d5\u0648\", \"\u0695\u06d5\u0646\u06af\", \"\u0641\u06b5\u0627\u0646\", \"\u06a4\u06d5\u0698\u06cc\u0646\", \"\u06cc\u06d5\u06a9\"\r\n```\r\nor using your custom order:\r\n```python\r\n>>> inputList = [\"\u06cc\u06d5\u06a9\", \"\u0695\u06d5\u0646\u06af\", \"\u0626\u06d5\u0648\", \"\u0626\u0627\u0648\", \"\u06a4\u06d5\u0698\u06cc\u0646\", \"\u0641\u06b5\u0627\u0646\"]\r\n>>> inputOrder = list([\"\u0626\u0621\u0627\u0622\u0623\u0625\u0628\u067e\u062a\u062b\u062c\u0686\u062d\u062e\u062f\u068e\u068a\u0630\u0631\u0695\u0632\u0698\u0633\u0634\u0635\u0636\u0637\u0638\u0639\u063a\u0641\u06a4\u0642\u0643\u06a9\u06af\u06b4\u0644\u06b5\u0645\u0646\u0648\u06c6\u06ca\u06c9\u06cb\u0647\u06be\u06d5\u06cc\u06ce\"])\r\n>>> print(asosoft.CustomSort(inputList, inputOrder))\r\n```\r\n## Poem Meter Classifier\r\nIt classifies the meter of the input Kurdish poem typed in Arabic script. The lines of the poem should be seprated by new line char ('\\n').\r\nYou can find Kurdish poems in https://books.vejin.net/.\r\n```python\r\n>>> poem = \"\u06af\u06d5\u0631\u0686\u06cc \u062a\u0648\u0648\u0634\u06cc \u0695\u06d5\u0646\u062c\u06d5\u0695\u06c6\u06cc\u06cc \u0648 \u062d\u06d5\u0633\u0631\u06d5\u062a \u0648 \u062f\u06d5\u0631\u062f\u0645 \u0626\u06d5\u0645\u0646\\n\u0642\u06d5\u062a \u0644\u06d5\u062f\u06d5\u0633 \u0626\u06d5\u0645 \u0686\u06d5\u0631\u062e\u06d5 \u0633\u067e\u06b5\u06d5 \u0646\u0627\u0628\u06d5\u0632\u0645 \u0645\u06d5\u0631\u062f\u0645 \u0626\u06d5\u0645\u0646\\n\u0626\u0627\u0634\u0642\u06cc \u0686\u0627\u0648\u06cc \u06a9\u06d5\u0698\u0627\u06b5 \u0648 \u06af\u06d5\u0631\u062f\u0646\u06cc \u067e\u0695 \\n\u062e\u0627\u06b5 \u0646\u06cc\u0645\\n\u0626\u0627\u0634\u0642\u06cc \u06a9\u06ce\u0648 \u0648 \u062a\u06d5\u0644\u0627\u0646 \u0648 \u0628\u06d5\u0646\u062f\u06d5\u0646 \u0648 \u0628\u06d5\u0631\u062f\u0645 \u0626\u06d5\u0645\u0646\"\r\n>>> classified = asosoft.ClassifyKurdishPoem(poem)\r\n>>> print(\"Poem Type= \" + classified.overalMeterType)\r\n>>> print(\"Poem Meter= \" + classified.overalPattern)\r\n```\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "AsoSoft's Library for Kurdish language processing tasks",
"version": "0.1.3",
"project_urls": {
"Homepage": "https://github.com/AsoSoft/AsoSoft-Library-py"
},
"split_keywords": [
"natural-language-processing",
" normalization",
" unicode-normalization",
" central-kurdish",
" kurdish",
" sorani"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b1501e011cac32349b4f43f703f42ddb81c7830c75347025a413c427cdd1e1ad",
"md5": "4f862c8d80fbcd0cf91683b0960b80b5",
"sha256": "dca38064e5d13844b84accbfcf6d5a555c602f4c4c7d7b9315ee541bd155caaa"
},
"downloads": -1,
"filename": "asosoft-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4f862c8d80fbcd0cf91683b0960b80b5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 29944,
"upload_time": "2024-04-08T11:27:00",
"upload_time_iso_8601": "2024-04-08T11:27:00.739473Z",
"url": "https://files.pythonhosted.org/packages/b1/50/1e011cac32349b4f43f703f42ddb81c7830c75347025a413c427cdd1e1ad/asosoft-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c1d600fb8029863e349a728bfb8d355dca17112f1932a952bcdab41ec03b6490",
"md5": "b3cc20c47e9b4a29e1a230d3049e27db",
"sha256": "dda5dc93457648380fd87ad978af5ec2c9159f6102953955ff20e6b8b1333850"
},
"downloads": -1,
"filename": "asosoft-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "b3cc20c47e9b4a29e1a230d3049e27db",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 31794,
"upload_time": "2024-04-08T11:27:03",
"upload_time_iso_8601": "2024-04-08T11:27:03.409220Z",
"url": "https://files.pythonhosted.org/packages/c1/d6/00fb8029863e349a728bfb8d355dca17112f1932a952bcdab41ec03b6490/asosoft-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-08 11:27:03",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "AsoSoft",
"github_project": "AsoSoft-Library-py",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "regex",
"specs": [
[
">=",
"2023.0.0"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"7.0"
]
]
},
{
"name": "twine",
"specs": [
[
">=",
"4.0.2"
]
]
}
],
"lcname": "asosoft"
}