# EduSenti: Education Review Sentiment in Albanian
[![PyPI][pypi-badge]][pypi-link]
[![Python 3.10][python3100-badge]][python3100-link]
[![Python 3.11][python311-badge]][python311-link]
Pretraining and sentiment student to instructor review corpora and analysis in
Albanian. This repository contains the code base to be used for the paper
[RoBERTa Low Resource Fine Tuning for Sentiment Analysis in Albanian]. To
reproduce the results, see the paper [reproduction repository]. If you use our
model or API, please [cite](#citation) our paper.
<!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc -->
## Table of Contents
- [Obtaining](#obtaining)
- [Usage](#usage)
- [API](#api)
- [Models](#models)
- [Differences from the Paper Repository](#differences-from-the-paper-repository)
- [Documentation](#documentation)
- [Changelog](#changelog)
- [Citation](#citation)
- [License](#license)
<!-- markdown-toc end -->
## Obtaining
The library can be installed with pip from the [pypi] repository:
```bash
pip3 install zensols.edusenti
```
The [models](#models) are downloaded on the first use of the command-line or
API.
## Usage
Command line:
```bash
$ edusenti predict sq.txt
(+): <Per shkak të gjendjes së krijuar si pasojë e pandemisë edhe ne sikur [...]>
(-): <Fillimisht isha e shqetësuar se si do ti mbanim kuizet, si do të [...]>
(+): <Kjo gjendje ka vazhduar edhe në kohën e provimeve>
...
```
Use the `csv` action to write all predictions to a comma-delimited file (use
`edusent --help`).
## API
```python
>>> from zensols.edusenti import (
>>> ApplicationFactory, Application, SentimentFeatureDocument
>>> )
>>> app: Application = ApplicationFactory.get_application()
>>> doc: SentimentFeatureDocument
>>> for doc in app.predict(['Kjo gjendje ka vazhduar edhe në kohën e provimeve']):
>>> print(f'sentence: {doc.text}')
>>> print(f'prediction: {doc.pred}')
>>> print(f'prediction: {doc.softmax_logit}')
sentence: Kjo gjendje ka vazhduar edhe në kohën e provimeve
prediction: +
logits: {'+': 0.70292175, '-': 0.17432323, 'n': 0.12275504}
```
## Models
The [models] are downloaded the first time the API is used. To change the
model (by default `xlm-roberta-base` is used) on the command-line, use
`--override esi_default.model_namel=xlm-roberta-large`. You can also create a
`~/.edusentirc` file with the following:
```ini
[esi_default]
model_namel = xlm-roberta-large
```
Performance of the models on the test set when trained and validated are below.
| Model | F1 | Precision | Recall |
|:--------------------|-----:|----------:|-------:|
| `xlm-roberta-base` | 78.1 | 80.7 | 79.7 |
| `xlm-roberta-large` | 83.5 | 84.9 | 84.7 |
However, the distributed models were trained on the training and test sets
combined. The validation metrics of those trained models are available on the
command line with `edusenti info`.
## Differences from the Paper Repository
The paper [reproduction repository] has quite a few differences, mostly around
reproducibility. However, this repository is designed to be a package used for
research that applies the model. To reproduce the results of the paper, please
refer to the [reproduction repository]. To use the best performing model
(XLM-RoBERTa Large) from that paper, then use this repository.
The primary difference is this repo has significantly better performance in
Albanian, which climbed from from F1 71.9 to 83.5 (see [models](#models)).
However, this repository has no English sentiment model since it was only used
for comparing methods.
Changes include:
* Python was upgraded from 3.9.9 to 3.11.6
* PyTorch was upgraded from 1.12.1 to 2.1.1
* HuggingFace transformers was upgraded from 4.19 to 4.35
* [zensols.deepnlp] was upgraded from 1.8 to 1.13
* The dataset was re-split and stratified.
## Documentation
See the [full documentation](https://plandes.github.io/edusenti/index.html).
The [API reference](https://plandes.github.io/edusenti/api.html) is also
available.
## Changelog
An extensive changelog is available [here](CHANGELOG.md).
## Citation
If you use this project in your research please use the following BibTeX entry:
```bibtex
@inproceedings{nuci-etal-2024-roberta-low,
title = "{R}o{BERT}a Low Resource Fine Tuning for Sentiment Analysis in {A}lbanian",
author = "Nuci, Krenare Pireva and
Landes, Paul and
Di Eugenio, Barbara",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italy",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.1233",
pages = "14146--14151"
}
```
## License
[MIT License](LICENSE.md)
Copyright (c) 2023 - 2024 Paul Landes and Krenare Pireva Nuci
<!-- links -->
[pypi]: https://pypi.org/project/zensols.edusenti/
[pypi-link]: https://pypi.python.org/pypi/zensols.edusenti
[pypi-badge]: https://img.shields.io/pypi/v/zensols.edusenti.svg
[python3100-badge]: https://img.shields.io/badge/python-3.10-blue.svg
[python3100-link]: https://www.python.org/downloads/release/python-3100
[python311-badge]: https://img.shields.io/badge/python-3.11-blue.svg
[python311-link]: https://www.python.org/downloads/release/python-3110
[RoBERTa Low Resource Fine Tuning for Sentiment Analysis in Albanian]: https://example.com
[reproduction repository]: https://github.com/uic-nlp-lab/edusenti
[models]: https://zenodo.org/records/10795173
[zensols.deepnlp]: https://github.com/plandes/deepnlp
Raw data
{
"_id": null,
"home_page": "https://github.com/plandes/edusenti",
"name": "zensols.edusenti",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "tooling",
"author": "Paul Landes",
"author_email": "landes@mailc.net",
"download_url": "https://github.com/plandes/edusenti/releases/download/v0.0.1/zensols.edusenti-0.0.1-py3-none-any.whl",
"platform": null,
"description": "# EduSenti: Education Review Sentiment in Albanian\n\n[![PyPI][pypi-badge]][pypi-link]\n[![Python 3.10][python3100-badge]][python3100-link]\n[![Python 3.11][python311-badge]][python311-link]\n\nPretraining and sentiment student to instructor review corpora and analysis in\nAlbanian. This repository contains the code base to be used for the paper\n[RoBERTa Low Resource Fine Tuning for Sentiment Analysis in Albanian]. To\nreproduce the results, see the paper [reproduction repository]. If you use our\nmodel or API, please [cite](#citation) our paper.\n\n<!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc -->\n## Table of Contents\n\n- [Obtaining](#obtaining)\n- [Usage](#usage)\n- [API](#api)\n- [Models](#models)\n- [Differences from the Paper Repository](#differences-from-the-paper-repository)\n- [Documentation](#documentation)\n- [Changelog](#changelog)\n- [Citation](#citation)\n- [License](#license)\n\n<!-- markdown-toc end -->\n\n\n\n## Obtaining\n\nThe library can be installed with pip from the [pypi] repository:\n```bash\npip3 install zensols.edusenti\n```\n\nThe [models](#models) are downloaded on the first use of the command-line or\nAPI.\n\n\n## Usage\n\nCommand line:\n```bash\n$ edusenti predict sq.txt\n(+): <Per shkak t\u00eb gjendjes s\u00eb krijuar si pasoj\u00eb e pandemis\u00eb edhe ne sikur [...]>\n(-): <Fillimisht isha e shqet\u00ebsuar se si do ti mbanim kuizet, si do t\u00eb [...]>\n(+): <Kjo gjendje ka vazhduar edhe n\u00eb koh\u00ebn e provimeve>\n...\n```\n\nUse the `csv` action to write all predictions to a comma-delimited file (use\n`edusent --help`).\n\n\n## API\n\n```python\n>>> from zensols.edusenti import (\n>>> ApplicationFactory, Application, SentimentFeatureDocument\n>>> )\n>>> app: Application = ApplicationFactory.get_application()\n>>> doc: SentimentFeatureDocument\n>>> for doc in app.predict(['Kjo gjendje ka vazhduar edhe n\u00eb koh\u00ebn e provimeve']):\n>>> print(f'sentence: {doc.text}')\n>>> print(f'prediction: {doc.pred}')\n>>> print(f'prediction: {doc.softmax_logit}')\n\nsentence: Kjo gjendje ka vazhduar edhe n\u00eb koh\u00ebn e provimeve\nprediction: +\nlogits: {'+': 0.70292175, '-': 0.17432323, 'n': 0.12275504}\n```\n\n\n## Models\n\nThe [models] are downloaded the first time the API is used. To change the\nmodel (by default `xlm-roberta-base` is used) on the command-line, use\n`--override esi_default.model_namel=xlm-roberta-large`. You can also create a\n`~/.edusentirc` file with the following:\n\n```ini\n[esi_default]\nmodel_namel = xlm-roberta-large\n```\n\nPerformance of the models on the test set when trained and validated are below.\n\n| Model | F1 | Precision | Recall |\n|:--------------------|-----:|----------:|-------:|\n| `xlm-roberta-base` | 78.1 | 80.7 | 79.7 |\n| `xlm-roberta-large` | 83.5 | 84.9 | 84.7 |\n\nHowever, the distributed models were trained on the training and test sets\ncombined. The validation metrics of those trained models are available on the\ncommand line with `edusenti info`.\n\n\n## Differences from the Paper Repository\n\nThe paper [reproduction repository] has quite a few differences, mostly around\nreproducibility. However, this repository is designed to be a package used for\nresearch that applies the model. To reproduce the results of the paper, please\nrefer to the [reproduction repository]. To use the best performing model\n(XLM-RoBERTa Large) from that paper, then use this repository.\n\nThe primary difference is this repo has significantly better performance in\nAlbanian, which climbed from from F1 71.9 to 83.5 (see [models](#models)).\nHowever, this repository has no English sentiment model since it was only used\nfor comparing methods.\n\nChanges include:\n\n* Python was upgraded from 3.9.9 to 3.11.6\n* PyTorch was upgraded from 1.12.1 to 2.1.1\n* HuggingFace transformers was upgraded from 4.19 to 4.35\n* [zensols.deepnlp] was upgraded from 1.8 to 1.13\n* The dataset was re-split and stratified.\n\n\n## Documentation\n\nSee the [full documentation](https://plandes.github.io/edusenti/index.html).\nThe [API reference](https://plandes.github.io/edusenti/api.html) is also\navailable.\n\n\n## Changelog\n\nAn extensive changelog is available [here](CHANGELOG.md).\n\n\n## Citation\n\nIf you use this project in your research please use the following BibTeX entry:\n\n```bibtex\n@inproceedings{nuci-etal-2024-roberta-low,\n title = \"{R}o{BERT}a Low Resource Fine Tuning for Sentiment Analysis in {A}lbanian\",\n author = \"Nuci, Krenare Pireva and\n Landes, Paul and\n Di Eugenio, Barbara\",\n editor = \"Calzolari, Nicoletta and\n Kan, Min-Yen and\n Hoste, Veronique and\n Lenci, Alessandro and\n Sakti, Sakriani and\n Xue, Nianwen\",\n booktitle = \"Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)\",\n month = may,\n year = \"2024\",\n address = \"Torino, Italy\",\n publisher = \"ELRA and ICCL\",\n url = \"https://aclanthology.org/2024.lrec-main.1233\",\n pages = \"14146--14151\"\n}\n```\n\n\n## License\n\n[MIT License](LICENSE.md)\n\nCopyright (c) 2023 - 2024 Paul Landes and Krenare Pireva Nuci\n\n\n<!-- links -->\n[pypi]: https://pypi.org/project/zensols.edusenti/\n[pypi-link]: https://pypi.python.org/pypi/zensols.edusenti\n[pypi-badge]: https://img.shields.io/pypi/v/zensols.edusenti.svg\n[python3100-badge]: https://img.shields.io/badge/python-3.10-blue.svg\n[python3100-link]: https://www.python.org/downloads/release/python-3100\n[python311-badge]: https://img.shields.io/badge/python-3.11-blue.svg\n[python311-link]: https://www.python.org/downloads/release/python-3110\n\n[RoBERTa Low Resource Fine Tuning for Sentiment Analysis in Albanian]: https://example.com\n[reproduction repository]: https://github.com/uic-nlp-lab/edusenti\n[models]: https://zenodo.org/records/10795173\n[zensols.deepnlp]: https://github.com/plandes/deepnlp\n",
"bugtrack_url": null,
"license": null,
"summary": "Pretraining and sentiment student to instructor review sentiment corpora and analysis.",
"version": "0.0.1",
"project_urls": {
"Download": "https://github.com/plandes/edusenti/releases/download/v0.0.1/zensols.edusenti-0.0.1-py3-none-any.whl",
"Homepage": "https://github.com/plandes/edusenti"
},
"split_keywords": [
"tooling"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "292db27277033ad7dbe1c004cd4650d1d7eb3164669dc680b8083ab623cb85a0",
"md5": "03edc904c1f2f2f2ec26e5672cef1958",
"sha256": "889b9ac688834b3fc61b52e2aad15ccad55e856280eb4ac9573fdca0c8b11c82"
},
"downloads": -1,
"filename": "zensols.edusenti-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "03edc904c1f2f2f2ec26e5672cef1958",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 8588,
"upload_time": "2024-05-19T22:04:34",
"upload_time_iso_8601": "2024-05-19T22:04:34.034152Z",
"url": "https://files.pythonhosted.org/packages/29/2d/b27277033ad7dbe1c004cd4650d1d7eb3164669dc680b8083ab623cb85a0/zensols.edusenti-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-19 22:04:34",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "plandes",
"github_project": "edusenti",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "zensols.edusenti"
}