cpia


Namecpia JSON
Version 2024.7.8 PyPI version JSON
download
home_pageNone
SummaryContemporary Persian word analyzer
upload_time2024-07-07 22:24:55
maintainerNone
docs_urlNone
authorDavood Heidarpour
requires_pythonNone
licenseNone
keywords persian farsi inflection analyzer inormal formal lemmatizer converter
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            Contemporary Persian Inflectional Analyzer  
==========================================
[![PyPI version](https://img.shields.io/badge/pypi-v2024.7.8-blue)](https://pypi.org/project/cpia/)
[![calver YYYY.MM.DD](https://img.shields.io/badge/calver-YYYY.MM.DD-22bfda.svg)](http://calver.org/)

Analyze Informal and Formal words of contemporary Persian.

Install
-------
    pip install cpia

Usage
-----
```python
>>> from cpia import FarsiAnalyzer, Converter
>>> farsi = FarsiAnalyzer()

>>> farsi.inflect("کتاب‌هایشان")
['اسمعا=کتاب+جها+وشخصی۶+رسمی']

>>> farsi.inflect("بشینین")
['التزامی=نشین+ش۵', 'امری=نشین+ش۵']

>>> farsi.generate("امری=گو+مفرد+رسمی")
['بگو']

>>> print(farsi.generate('ف.ح.ا=خور+ش۱+ومفعولی۲')[0])
می‌‌خورمت

>>> farsi.lemmatize(farsi.inflect("میچرخوندمش")[0])
{'lemma': 'چرخوند',
 'pos': 'ف.م.ا',
 'register': 'غیررسمی',
 'long_pos': 'فعل ماضی استمراری'}

>>> converter = Converter(farsi)
>>> print(converter.convert("میچرخوندمش", "formal")[0])
می‌چرخاندم

```
For understanding abbreviations used in inflection rules:
```python
>>> farsi.show_help()
🔹  ف.م.ب 👈 فعل ماضی بعید*
🔹  ف.م.ال 👈 فعل ماضی التزامی*
🔹  ف.م.ا.ب 👈 فعل ماضی ابعد*
🔹  ف.آ 👈 فعل مستقبل (آینده)*
🔹  اسمعام 👈 اسم عام
          ...
```
Other than `standard` fst for inflection and `generation` fst for generating words from rules, cpia has secondary fsts. The main fst is enough for almost all tasks but the secondary fsts can be used for noisy informal Out-Of-Vocabulary words, they normally can produce a lot of useless inflections. They are only useful for special cases. Use them only if you know what you want.
If you need to use other fsts, just pass their name as argument to the FarsiAnalyzer constructor:
```python
>>> farsi = FarsiAnalyzer("homophone")
```

Fsts
----

| Name                  |           word          |                                                         output |
|----------------------|:-----------------------:|---------------------------------------------------------------:|
| **standard**             |           **برم**           | **<اسمعام=بره+وشخصی۱><br><اسمعام=بر+هم><br><اسمعام=بر+وشخصی۱+رسمی><br><اسمعام=بر+وربطی۱+رسمی><br><اسمعام=برم+رسمی><br><حضاف=بر+وشخصی۱+رسمی><br><التزامی=ر+ش۱><br><امری=رم+مفرد+رسمی>** |

Secondary Fsts
--------------

| Name                  |           word          |                                                         output |
|----------------------|:-----------------------:|---------------------------------------------------------------:|
| **homophone**            | **مسؤول<br>مسئول<br>مسیول** |                                       <**اسمعام=مسئول+رسمی>** |
| **phone_change (avaee)**                |          **شیطون**          |                                            **<اسمعام=شیطان>** |
| **expressive**           | **چرااااااا** |                                         **<اسمعام=چرا+رسمی>** |
| **splitter**             |         **چهاربعدی**        |               **<شماره=چهار+رسمی><br><صفت=بعدی+رسمی>** |

<p align="center">
  <img src="https://github.com/lingwndr/cpia/blob/master/icon.png?raw=true" alt="تحلیلگر تصریفی فارسی معاصر" width="150"/>
</p>

Evaluation
----------

The analyzer is not aware of context but the output should provide all possible inflections for all possible contexts. Eval dataset is in `eval` folder. For 1786 unique words extracted from dataset analyzer produced 3,704 inflections rules. Here are the shortcomings counted based on their occurances.

| register | OOV | OO-Rules | homophone / Ezafeh Const. | stucking words | phone changing | spelling error |
|:--------:|:---:|:--------:|:-------------------------:|:--------------:|:--------------:|:--------------:|
| informal |  40 |     4    |             3             |        3       |        5       |        8       |
|  formal  |  83 |     6    |             0             |        1       |        0       |       17       |

The **`recall`** metric is calculated for all FSTs as below

| register / FST          | standard   | homophone | phone_change | expressive | splitter |
|-------------------------|------------|-----------|--------------|------------|----------|
| informal                | **96.33%** | 96.42%    | 97.3%        | 97.3%      | 97.48%   |
| formal                  | **95.1%**  | 95.1%     | 95.1%        | 95.1%      | 95.1%    |
| combined (Contemporary) | **95.56%** | 95.64%    | 96%          | 96%        | 96.08%   |

OOVs and OO-Rules
-----------------
There is a list of words and inflections in [OOs/extra.txt](https://github.com/lingwndr/cpia/blob/master/app/OOs/extra.txt) that are not included in Fsts. You can directly contribute to this list. This list will be used to update Fsts in a proper manner periodically. For contributing directly to this list, please use the following format, and for inflection, use the structure of this analyzer. Note that the third column (the context that the word appears in) is optional.

`فونت[TAB]اسمعام=فونت+رسمی[TAB]فونت قشنگی استفاده کردن`

Persian word structure; informal and formal
--------------------------------------------
Comprehensive structure of words especially informal words are explained in the `Contemporary Persian Inflectional Analyzer` paper in full detail: [`docs/informal-analyzer.pdf`](https://github.com/lingwndr/cpia/blob/master/docs/informal-analyzer.pdf); or [from the Journal website](https://jipm.irandoc.ac.ir/article-1-4337-en.html%3B)
### Citation
```bibtex
@article{Heidarpour2021, 
  title = {Contemporary Persian Inflectional Analyzer}, 
  author = {Heidarpour, Davood and S.Sebt, Elham and Bi Jen Khan, Mahmoud and Salehi, Mostafa and Veisi, Hadi },  
  volume = {36}, 
  number = {4},  
  URL = {http://jipm.irandoc.ac.ir/article-1-4337-en.html},  
  eprint = {http://jipm.irandoc.ac.ir/article-1-4337-en.pdf},  
  journal = {Iranian Journal of Information Processing and Management},   
  doi = {10.52547/jipm.36.4.945},  
  year = {2021}  
}
```
Fst word rule structure; informal and formal
--------------------------------------------
All the lexicon, morphotactic and morphophonemic rules are in `lexc` folder. These files are used by a tool called [Foma](https://fomafst.github.io/) to compile Fsts.
How the rules of words are developed to make Fsts are explained in `Thesis`: [`docs/thesis.pdf`](https://github.com/lingwndr/cpia/blob/master/docs/thesis.pdf)

### Citation
```bibtex
@mastersthesis{Heidarpour2018,
  title = {An inflectional analyzer for contemporary Persian},
  author = {Heidarpour, Davood and Salehi, Mostafa and Bi Jen Khan, Mahmoud and Veisi, Hadi},
  year = {2018}
} 
```
Secondary Fsts
--------------
These Fsts are designed for covering out-of-vocabulary informal/noisy words and are explained in `Covering Out-of-Vocabulary Words of Informal Persian` paper: [`docs/informal-oov.pdf`](https://github.com/lingwndr/cpia/blob/master/docs/informal-oov.pdf)
### Citation
```bibtex
@incollection{Heidarpour2019, 
  title = {Covering Out-of-Vocabulary Words of Informal Persian}, 
  author = {Heidarpour, Davood and Salehi, Mostafa and Bi Jen Khan, Mahmoud and Veisi, Hadi and Ranjbar, Vahid},  
  booktitle = {5th National Conference on Computational Linguistics},
  URL = {https://neveeseh.com},  
  year = {2019}  
}
```


License
-------
Licensed under GNU General Public License Version 3 (GPLv3)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "cpia",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "Persian, Farsi, Inflection, Analyzer, Inormal, Formal, Lemmatizer, Converter",
    "author": "Davood Heidarpour",
    "author_email": "<chandracar@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/03/a8/3bacf7e5512e20c59594316992424be2008321d2950f523b89a328876d84/cpia-2024.7.8.tar.gz",
    "platform": null,
    "description": "Contemporary Persian Inflectional Analyzer  \n==========================================\n[![PyPI version](https://img.shields.io/badge/pypi-v2024.7.8-blue)](https://pypi.org/project/cpia/)\n[![calver YYYY.MM.DD](https://img.shields.io/badge/calver-YYYY.MM.DD-22bfda.svg)](http://calver.org/)\n\nAnalyze Informal and Formal words of contemporary Persian.\n\nInstall\n-------\n    pip install cpia\n\nUsage\n-----\n```python\n>>> from cpia import FarsiAnalyzer, Converter\n>>> farsi = FarsiAnalyzer()\n\n>>> farsi.inflect(\"\u06a9\u062a\u0627\u0628\u200c\u0647\u0627\u06cc\u0634\u0627\u0646\")\n['\u0627\u0633\u0645\u0639\u0627=\u06a9\u062a\u0627\u0628+\u062c\u0647\u0627+\u0648\u0634\u062e\u0635\u06cc\u06f6+\u0631\u0633\u0645\u06cc']\n\n>>> farsi.inflect(\"\u0628\u0634\u06cc\u0646\u06cc\u0646\")\n['\u0627\u0644\u062a\u0632\u0627\u0645\u06cc=\u0646\u0634\u06cc\u0646+\u0634\u06f5', '\u0627\u0645\u0631\u06cc=\u0646\u0634\u06cc\u0646+\u0634\u06f5']\n\n>>> farsi.generate(\"\u0627\u0645\u0631\u06cc=\u06af\u0648+\u0645\u0641\u0631\u062f+\u0631\u0633\u0645\u06cc\")\n['\u0628\u06af\u0648']\n\n>>> print(farsi.generate('\u0641.\u062d.\u0627=\u062e\u0648\u0631+\u0634\u06f1+\u0648\u0645\u0641\u0639\u0648\u0644\u06cc\u06f2')[0])\n\u0645\u06cc\u200c\u200c\u062e\u0648\u0631\u0645\u062a\n\n>>> farsi.lemmatize(farsi.inflect(\"\u0645\u06cc\u0686\u0631\u062e\u0648\u0646\u062f\u0645\u0634\")[0])\n{'lemma': '\u0686\u0631\u062e\u0648\u0646\u062f',\n 'pos': '\u0641.\u0645.\u0627',\n 'register': '\u063a\u06cc\u0631\u0631\u0633\u0645\u06cc',\n 'long_pos': '\u0641\u0639\u0644 \u0645\u0627\u0636\u06cc \u0627\u0633\u062a\u0645\u0631\u0627\u0631\u06cc'}\n\n>>> converter = Converter(farsi)\n>>> print(converter.convert(\"\u0645\u06cc\u0686\u0631\u062e\u0648\u0646\u062f\u0645\u0634\", \"formal\")[0])\n\u0645\u06cc\u200c\u0686\u0631\u062e\u0627\u0646\u062f\u0645\n\n```\nFor understanding abbreviations used in inflection rules:\n```python\n>>> farsi.show_help()\n\ud83d\udd39  \u0641.\u0645.\u0628 \ud83d\udc48 \u0641\u0639\u0644 \u0645\u0627\u0636\u06cc \u0628\u0639\u06cc\u062f*\n\ud83d\udd39  \u0641.\u0645.\u0627\u0644 \ud83d\udc48 \u0641\u0639\u0644 \u0645\u0627\u0636\u06cc \u0627\u0644\u062a\u0632\u0627\u0645\u06cc*\n\ud83d\udd39  \u0641.\u0645.\u0627.\u0628 \ud83d\udc48 \u0641\u0639\u0644 \u0645\u0627\u0636\u06cc \u0627\u0628\u0639\u062f*\n\ud83d\udd39  \u0641.\u0622 \ud83d\udc48 \u0641\u0639\u0644 \u0645\u0633\u062a\u0642\u0628\u0644 (\u0622\u06cc\u0646\u062f\u0647)*\n\ud83d\udd39  \u0627\u0633\u0645\u0639\u0627\u0645 \ud83d\udc48 \u0627\u0633\u0645 \u0639\u0627\u0645\n          ...\n```\nOther than `standard` fst for inflection and `generation` fst for generating words from rules, cpia has secondary fsts. The main fst is enough for almost all tasks but the secondary fsts can be used for noisy informal Out-Of-Vocabulary words, they normally can produce a lot of useless inflections. They are only useful for special cases. Use them only if you know what you want.\nIf you need to use other fsts, just pass their name as argument to the FarsiAnalyzer constructor:\n```python\n>>> farsi = FarsiAnalyzer(\"homophone\")\n```\n\nFsts\n----\n\n| Name                  |           word          |                                                         output |\n|----------------------|:-----------------------:|---------------------------------------------------------------:|\n| **standard**             |           **\u0628\u0631\u0645**           | **<\u0627\u0633\u0645\u0639\u0627\u0645=\u0628\u0631\u0647+\u0648\u0634\u062e\u0635\u06cc\u06f1><br><\u0627\u0633\u0645\u0639\u0627\u0645=\u0628\u0631+\u0647\u0645><br><\u0627\u0633\u0645\u0639\u0627\u0645=\u0628\u0631+\u0648\u0634\u062e\u0635\u06cc\u06f1+\u0631\u0633\u0645\u06cc><br><\u0627\u0633\u0645\u0639\u0627\u0645=\u0628\u0631+\u0648\u0631\u0628\u0637\u06cc\u06f1+\u0631\u0633\u0645\u06cc><br><\u0627\u0633\u0645\u0639\u0627\u0645=\u0628\u0631\u0645+\u0631\u0633\u0645\u06cc><br><\u062d\u0636\u0627\u0641=\u0628\u0631+\u0648\u0634\u062e\u0635\u06cc\u06f1+\u0631\u0633\u0645\u06cc><br><\u0627\u0644\u062a\u0632\u0627\u0645\u06cc=\u0631+\u0634\u06f1><br><\u0627\u0645\u0631\u06cc=\u0631\u0645+\u0645\u0641\u0631\u062f+\u0631\u0633\u0645\u06cc>** |\n\nSecondary Fsts\n--------------\n\n| Name                  |           word          |                                                         output |\n|----------------------|:-----------------------:|---------------------------------------------------------------:|\n| **homophone**            | **\u0645\u0633\u0624\u0648\u0644<br>\u0645\u0633\u0626\u0648\u0644<br>\u0645\u0633\u06cc\u0648\u0644** |                                       <**\u0627\u0633\u0645\u0639\u0627\u0645=\u0645\u0633\u0626\u0648\u0644+\u0631\u0633\u0645\u06cc>** |\n| **phone_change (avaee)**                |          **\u0634\u06cc\u0637\u0648\u0646**          |                                            **<\u0627\u0633\u0645\u0639\u0627\u0645=\u0634\u06cc\u0637\u0627\u0646>** |\n| **expressive**           | **\u0686\u0631\u0627\u0627\u0627\u0627\u0627\u0627\u0627** |                                         **<\u0627\u0633\u0645\u0639\u0627\u0645=\u0686\u0631\u0627+\u0631\u0633\u0645\u06cc>** |\n| **splitter**             |         **\u0686\u0647\u0627\u0631\u0628\u0639\u062f\u06cc**        |               **<\u0634\u0645\u0627\u0631\u0647=\u0686\u0647\u0627\u0631+\u0631\u0633\u0645\u06cc><br><\u0635\u0641\u062a=\u0628\u0639\u062f\u06cc+\u0631\u0633\u0645\u06cc>** |\n\n<p align=\"center\">\n  <img src=\"https://github.com/lingwndr/cpia/blob/master/icon.png?raw=true\" alt=\"\u062a\u062d\u0644\u06cc\u0644\u06af\u0631 \u062a\u0635\u0631\u06cc\u0641\u06cc \u0641\u0627\u0631\u0633\u06cc \u0645\u0639\u0627\u0635\u0631\" width=\"150\"/>\n</p>\n\nEvaluation\n----------\n\nThe analyzer is not aware of context but the output should provide all possible inflections for all possible contexts. Eval dataset is in `eval` folder. For 1786 unique words extracted from dataset analyzer produced 3,704 inflections rules. Here are the shortcomings counted based on their occurances.\n\n| register | OOV | OO-Rules | homophone / Ezafeh Const. | stucking words | phone changing | spelling error |\n|:--------:|:---:|:--------:|:-------------------------:|:--------------:|:--------------:|:--------------:|\n| informal |  40 |     4    |             3             |        3       |        5       |        8       |\n|  formal  |  83 |     6    |             0             |        1       |        0       |       17       |\n\nThe **`recall`** metric is calculated for all FSTs as below\n\n| register / FST          | standard   | homophone | phone_change | expressive | splitter |\n|-------------------------|------------|-----------|--------------|------------|----------|\n| informal                | **96.33%** | 96.42%    | 97.3%        | 97.3%      | 97.48%   |\n| formal                  | **95.1%**  | 95.1%     | 95.1%        | 95.1%      | 95.1%    |\n| combined (Contemporary) | **95.56%** | 95.64%    | 96%          | 96%        | 96.08%   |\n\nOOVs and OO-Rules\n-----------------\nThere is a list of words and inflections in [OOs/extra.txt](https://github.com/lingwndr/cpia/blob/master/app/OOs/extra.txt) that are not included in Fsts. You can directly contribute to this list. This list will be used to update Fsts in a proper manner periodically. For contributing directly to this list, please use the following format, and for inflection, use the structure of this analyzer. Note that the third column (the context that the word appears in) is optional.\n\n`\u0641\u0648\u0646\u062a[TAB]\u0627\u0633\u0645\u0639\u0627\u0645=\u0641\u0648\u0646\u062a+\u0631\u0633\u0645\u06cc[TAB]\u0641\u0648\u0646\u062a \u0642\u0634\u0646\u06af\u06cc \u0627\u0633\u062a\u0641\u0627\u062f\u0647 \u06a9\u0631\u062f\u0646`\n\nPersian word structure; informal and formal\n--------------------------------------------\nComprehensive structure of words especially informal words are explained in the `Contemporary Persian Inflectional Analyzer` paper in full detail: [`docs/informal-analyzer.pdf`](https://github.com/lingwndr/cpia/blob/master/docs/informal-analyzer.pdf); or [from the Journal website](https://jipm.irandoc.ac.ir/article-1-4337-en.html%3B)\n### Citation\n```bibtex\n@article{Heidarpour2021, \n  title = {Contemporary Persian Inflectional Analyzer}, \n  author = {Heidarpour, Davood and S.Sebt, Elham and Bi Jen Khan, Mahmoud and Salehi, Mostafa and Veisi, Hadi },  \n  volume = {36}, \n  number = {4},  \n  URL = {http://jipm.irandoc.ac.ir/article-1-4337-en.html},  \n  eprint = {http://jipm.irandoc.ac.ir/article-1-4337-en.pdf},  \n  journal = {Iranian Journal of Information Processing and Management},   \n  doi = {10.52547/jipm.36.4.945},  \n  year = {2021}  \n}\n```\nFst word rule structure; informal and formal\n--------------------------------------------\nAll the lexicon, morphotactic and morphophonemic rules are in `lexc` folder. These files are used by a tool called [Foma](https://fomafst.github.io/) to compile Fsts.\nHow the rules of words are developed to make Fsts are explained in `Thesis`: [`docs/thesis.pdf`](https://github.com/lingwndr/cpia/blob/master/docs/thesis.pdf)\n\n### Citation\n```bibtex\n@mastersthesis{Heidarpour2018,\n  title = {An inflectional analyzer for contemporary Persian},\n  author = {Heidarpour, Davood and Salehi, Mostafa and Bi Jen Khan, Mahmoud and Veisi, Hadi},\n  year = {2018}\n} \n```\nSecondary Fsts\n--------------\nThese Fsts are designed for covering out-of-vocabulary informal/noisy words and are explained in `Covering Out-of-Vocabulary Words of Informal Persian` paper: [`docs/informal-oov.pdf`](https://github.com/lingwndr/cpia/blob/master/docs/informal-oov.pdf)\n### Citation\n```bibtex\n@incollection{Heidarpour2019, \n  title = {Covering Out-of-Vocabulary Words of Informal Persian}, \n  author = {Heidarpour, Davood and Salehi, Mostafa and Bi Jen Khan, Mahmoud and Veisi, Hadi and Ranjbar, Vahid},  \n  booktitle = {5th National Conference on Computational Linguistics},\n  URL = {https://neveeseh.com},  \n  year = {2019}  \n}\n```\n\n\nLicense\n-------\nLicensed under GNU General Public License Version 3 (GPLv3)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Contemporary Persian word analyzer",
    "version": "2024.7.8",
    "project_urls": {
        "Homepage": "https://github.com/lingwndr/cpia",
        "Repository": "https://github.com/lingwndr/cpia"
    },
    "split_keywords": [
        "persian",
        " farsi",
        " inflection",
        " analyzer",
        " inormal",
        " formal",
        " lemmatizer",
        " converter"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "142d2bf1c4adb028c6993e006455dfee8d78e501053eaadafc58ee7be6ee3af7",
                "md5": "a7efb7fddee7e16cf0465eb0a76c5396",
                "sha256": "7076cf801054e657f62dfa672beeaec0e64d769ab1236425fffa76d6a63e6b16"
            },
            "downloads": -1,
            "filename": "cpia-2024.7.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a7efb7fddee7e16cf0465eb0a76c5396",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 7191342,
            "upload_time": "2024-07-07T22:24:53",
            "upload_time_iso_8601": "2024-07-07T22:24:53.526272Z",
            "url": "https://files.pythonhosted.org/packages/14/2d/2bf1c4adb028c6993e006455dfee8d78e501053eaadafc58ee7be6ee3af7/cpia-2024.7.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "03a83bacf7e5512e20c59594316992424be2008321d2950f523b89a328876d84",
                "md5": "10115f3442480ac7d41a10e4c9b1522c",
                "sha256": "5259c2b8c77d8c05eadcde97a07147fc7bc357988d7377a8572487d6b221a057"
            },
            "downloads": -1,
            "filename": "cpia-2024.7.8.tar.gz",
            "has_sig": false,
            "md5_digest": "10115f3442480ac7d41a10e4c9b1522c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 7196539,
            "upload_time": "2024-07-07T22:24:55",
            "upload_time_iso_8601": "2024-07-07T22:24:55.652347Z",
            "url": "https://files.pythonhosted.org/packages/03/a8/3bacf7e5512e20c59594316992424be2008321d2950f523b89a328876d84/cpia-2024.7.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-07 22:24:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lingwndr",
    "github_project": "cpia",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "cpia"
}
        
Elapsed time: 0.30605s