zensols.edusenti

Name	zensols.edusenti JSON
Version	0.0.1 JSON
	download
home_page	https://github.com/plandes/edusenti
Summary	Pretraining and sentiment student to instructor review sentiment corpora and analysis.
upload_time	2024-05-19 22:04:34
maintainer	None
docs_url	None
author	Paul Landes
requires_python	None
license	None
keywords	tooling
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # EduSenti: Education Review Sentiment in Albanian

[![PyPI][pypi-badge]][pypi-link]
[![Python 3.10][python3100-badge]][python3100-link]
[![Python 3.11][python311-badge]][python311-link]

Pretraining and sentiment student to instructor review corpora and analysis in
Albanian.  This repository contains the code base to be used for the paper
[RoBERTa Low Resource Fine Tuning for Sentiment Analysis in Albanian].  To
reproduce the results, see the paper [reproduction repository].  If you use our
model or API, please [cite](#citation) our paper.

<!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc -->
## Table of Contents

- [Obtaining](#obtaining)
- [Usage](#usage)
- [API](#api)
- [Models](#models)
- [Differences from the Paper Repository](#differences-from-the-paper-repository)
- [Documentation](#documentation)
- [Changelog](#changelog)
- [Citation](#citation)
- [License](#license)

<!-- markdown-toc end -->



## Obtaining

The library can be installed with pip from the [pypi] repository:
```bash
pip3 install zensols.edusenti
```

The [models](#models) are downloaded on the first use of the command-line or
API.


## Usage

Command line:
```bash
$ edusenti predict sq.txt
(+): <Per shkak të gjendjes së krijuar si pasojë e pandemisë edhe ne sikur [...]>
(-): <Fillimisht isha e shqetësuar se si do ti mbanim kuizet, si do të [...]>
(+): <Kjo gjendje ka vazhduar edhe në kohën e provimeve>
...
```

Use the `csv` action to write all predictions to a comma-delimited file (use
`edusent --help`).


## API

```python
>>> from zensols.edusenti import (
>>>     ApplicationFactory, Application, SentimentFeatureDocument
>>> )
>>> app: Application = ApplicationFactory.get_application()
>>> doc: SentimentFeatureDocument
>>> for doc in app.predict(['Kjo gjendje ka vazhduar edhe në kohën e provimeve']):
>>>     print(f'sentence: {doc.text}')
>>>     print(f'prediction: {doc.pred}')
>>>     print(f'prediction: {doc.softmax_logit}')

sentence: Kjo gjendje ka vazhduar edhe në kohën e provimeve
prediction: +
logits: {'+': 0.70292175, '-': 0.17432323, 'n': 0.12275504}
```


## Models

The [models] are downloaded the first time the API is used.  To change the
model (by default `xlm-roberta-base` is used) on the command-line, use
`--override esi_default.model_namel=xlm-roberta-large`.  You can also create a
`~/.edusentirc` file with the following:

```ini
[esi_default]
model_namel = xlm-roberta-large
```

Performance of the models on the test set when trained and validated are below.

| Model               |   F1 | Precision | Recall |
|:--------------------|-----:|----------:|-------:|
| `xlm-roberta-base`  | 78.1 |      80.7 |   79.7 |
| `xlm-roberta-large` | 83.5 |      84.9 |   84.7 |

However, the distributed models were trained on the training and test sets
combined.  The validation metrics of those trained models are available on the
command line with `edusenti info`.


## Differences from the Paper Repository

The paper [reproduction repository] has quite a few differences, mostly around
reproducibility.  However, this repository is designed to be a package used for
research that applies the model.  To reproduce the results of the paper, please
refer to the [reproduction repository].  To use the best performing model
(XLM-RoBERTa Large) from that paper, then use this repository.

The primary difference is this repo has significantly better performance in
Albanian, which climbed from from F1 71.9 to 83.5 (see [models](#models)).
However, this repository has no English sentiment model since it was only used
for comparing methods.

Changes include:

* Python was upgraded from 3.9.9 to 3.11.6
* PyTorch was upgraded from 1.12.1 to 2.1.1
* HuggingFace transformers was upgraded from 4.19 to 4.35
* [zensols.deepnlp] was upgraded from 1.8 to 1.13
* The dataset was re-split and stratified.


## Documentation

See the [full documentation](https://plandes.github.io/edusenti/index.html).
The [API reference](https://plandes.github.io/edusenti/api.html) is also
available.


## Changelog

An extensive changelog is available [here](CHANGELOG.md).


## Citation

If you use this project in your research please use the following BibTeX entry:

```bibtex
@inproceedings{nuci-etal-2024-roberta-low,
    title = "{R}o{BERT}a Low Resource Fine Tuning for Sentiment Analysis in {A}lbanian",
    author = "Nuci, Krenare Pireva  and
      Landes, Paul  and
      Di Eugenio, Barbara",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italy",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1233",
    pages = "14146--14151"
}
```


## License

[MIT License](LICENSE.md)

Copyright (c) 2023 - 2024 Paul Landes and Krenare Pireva Nuci


<!-- links -->
[pypi]: https://pypi.org/project/zensols.edusenti/
[pypi-link]: https://pypi.python.org/pypi/zensols.edusenti
[pypi-badge]: https://img.shields.io/pypi/v/zensols.edusenti.svg
[python3100-badge]: https://img.shields.io/badge/python-3.10-blue.svg
[python3100-link]: https://www.python.org/downloads/release/python-3100
[python311-badge]: https://img.shields.io/badge/python-3.11-blue.svg
[python311-link]: https://www.python.org/downloads/release/python-3110

[RoBERTa Low Resource Fine Tuning for Sentiment Analysis in Albanian]: https://example.com
[reproduction repository]: https://github.com/uic-nlp-lab/edusenti
[models]: https://zenodo.org/records/10795173
[zensols.deepnlp]: https://github.com/plandes/deepnlp

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/plandes/edusenti",
    "name": "zensols.edusenti",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "tooling",
    "author": "Paul Landes",
    "author_email": "landes@mailc.net",
    "download_url": "https://github.com/plandes/edusenti/releases/download/v0.0.1/zensols.edusenti-0.0.1-py3-none-any.whl",
    "platform": null,
    "description": "# EduSenti: Education Review Sentiment in Albanian\n\n[![PyPI][pypi-badge]][pypi-link]\n[![Python 3.10][python3100-badge]][python3100-link]\n[![Python 3.11][python311-badge]][python311-link]\n\nPretraining and sentiment student to instructor review corpora and analysis in\nAlbanian.  This repository contains the code base to be used for the paper\n[RoBERTa Low Resource Fine Tuning for Sentiment Analysis in Albanian].  To\nreproduce the results, see the paper [reproduction repository].  If you use our\nmodel or API, please [cite](#citation) our paper.\n\n<!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc -->\n## Table of Contents\n\n- [Obtaining](#obtaining)\n- [Usage](#usage)\n- [API](#api)\n- [Models](#models)\n- [Differences from the Paper Repository](#differences-from-the-paper-repository)\n- [Documentation](#documentation)\n- [Changelog](#changelog)\n- [Citation](#citation)\n- [License](#license)\n\n<!-- markdown-toc end -->\n\n\n\n## Obtaining\n\nThe library can be installed with pip from the [pypi] repository:\n```bash\npip3 install zensols.edusenti\n```\n\nThe [models](#models) are downloaded on the first use of the command-line or\nAPI.\n\n\n## Usage\n\nCommand line:\n```bash\n$ edusenti predict sq.txt\n(+): <Per shkak t\u00eb gjendjes s\u00eb krijuar si pasoj\u00eb e pandemis\u00eb edhe ne sikur [...]>\n(-): <Fillimisht isha e shqet\u00ebsuar se si do ti mbanim kuizet, si do t\u00eb [...]>\n(+): <Kjo gjendje ka vazhduar edhe n\u00eb koh\u00ebn e provimeve>\n...\n```\n\nUse the `csv` action to write all predictions to a comma-delimited file (use\n`edusent --help`).\n\n\n## API\n\n```python\n>>> from zensols.edusenti import (\n>>>     ApplicationFactory, Application, SentimentFeatureDocument\n>>> )\n>>> app: Application = ApplicationFactory.get_application()\n>>> doc: SentimentFeatureDocument\n>>> for doc in app.predict(['Kjo gjendje ka vazhduar edhe n\u00eb koh\u00ebn e provimeve']):\n>>>     print(f'sentence: {doc.text}')\n>>>     print(f'prediction: {doc.pred}')\n>>>     print(f'prediction: {doc.softmax_logit}')\n\nsentence: Kjo gjendje ka vazhduar edhe n\u00eb koh\u00ebn e provimeve\nprediction: +\nlogits: {'+': 0.70292175, '-': 0.17432323, 'n': 0.12275504}\n```\n\n\n## Models\n\nThe [models] are downloaded the first time the API is used.  To change the\nmodel (by default `xlm-roberta-base` is used) on the command-line, use\n`--override esi_default.model_namel=xlm-roberta-large`.  You can also create a\n`~/.edusentirc` file with the following:\n\n```ini\n[esi_default]\nmodel_namel = xlm-roberta-large\n```\n\nPerformance of the models on the test set when trained and validated are below.\n\n| Model               |   F1 | Precision | Recall |\n|:--------------------|-----:|----------:|-------:|\n| `xlm-roberta-base`  | 78.1 |      80.7 |   79.7 |\n| `xlm-roberta-large` | 83.5 |      84.9 |   84.7 |\n\nHowever, the distributed models were trained on the training and test sets\ncombined.  The validation metrics of those trained models are available on the\ncommand line with `edusenti info`.\n\n\n## Differences from the Paper Repository\n\nThe paper [reproduction repository] has quite a few differences, mostly around\nreproducibility.  However, this repository is designed to be a package used for\nresearch that applies the model.  To reproduce the results of the paper, please\nrefer to the [reproduction repository].  To use the best performing model\n(XLM-RoBERTa Large) from that paper, then use this repository.\n\nThe primary difference is this repo has significantly better performance in\nAlbanian, which climbed from from F1 71.9 to 83.5 (see [models](#models)).\nHowever, this repository has no English sentiment model since it was only used\nfor comparing methods.\n\nChanges include:\n\n* Python was upgraded from 3.9.9 to 3.11.6\n* PyTorch was upgraded from 1.12.1 to 2.1.1\n* HuggingFace transformers was upgraded from 4.19 to 4.35\n* [zensols.deepnlp] was upgraded from 1.8 to 1.13\n* The dataset was re-split and stratified.\n\n\n## Documentation\n\nSee the [full documentation](https://plandes.github.io/edusenti/index.html).\nThe [API reference](https://plandes.github.io/edusenti/api.html) is also\navailable.\n\n\n## Changelog\n\nAn extensive changelog is available [here](CHANGELOG.md).\n\n\n## Citation\n\nIf you use this project in your research please use the following BibTeX entry:\n\n```bibtex\n@inproceedings{nuci-etal-2024-roberta-low,\n    title = \"{R}o{BERT}a Low Resource Fine Tuning for Sentiment Analysis in {A}lbanian\",\n    author = \"Nuci, Krenare Pireva  and\n      Landes, Paul  and\n      Di Eugenio, Barbara\",\n    editor = \"Calzolari, Nicoletta  and\n      Kan, Min-Yen  and\n      Hoste, Veronique  and\n      Lenci, Alessandro  and\n      Sakti, Sakriani  and\n      Xue, Nianwen\",\n    booktitle = \"Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)\",\n    month = may,\n    year = \"2024\",\n    address = \"Torino, Italy\",\n    publisher = \"ELRA and ICCL\",\n    url = \"https://aclanthology.org/2024.lrec-main.1233\",\n    pages = \"14146--14151\"\n}\n```\n\n\n## License\n\n[MIT License](LICENSE.md)\n\nCopyright (c) 2023 - 2024 Paul Landes and Krenare Pireva Nuci\n\n\n<!-- links -->\n[pypi]: https://pypi.org/project/zensols.edusenti/\n[pypi-link]: https://pypi.python.org/pypi/zensols.edusenti\n[pypi-badge]: https://img.shields.io/pypi/v/zensols.edusenti.svg\n[python3100-badge]: https://img.shields.io/badge/python-3.10-blue.svg\n[python3100-link]: https://www.python.org/downloads/release/python-3100\n[python311-badge]: https://img.shields.io/badge/python-3.11-blue.svg\n[python311-link]: https://www.python.org/downloads/release/python-3110\n\n[RoBERTa Low Resource Fine Tuning for Sentiment Analysis in Albanian]: https://example.com\n[reproduction repository]: https://github.com/uic-nlp-lab/edusenti\n[models]: https://zenodo.org/records/10795173\n[zensols.deepnlp]: https://github.com/plandes/deepnlp\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Pretraining and sentiment student to instructor review sentiment corpora and analysis.",
    "version": "0.0.1",
    "project_urls": {
        "Download": "https://github.com/plandes/edusenti/releases/download/v0.0.1/zensols.edusenti-0.0.1-py3-none-any.whl",
        "Homepage": "https://github.com/plandes/edusenti"
    },
    "split_keywords": [
        "tooling"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "292db27277033ad7dbe1c004cd4650d1d7eb3164669dc680b8083ab623cb85a0",
                "md5": "03edc904c1f2f2f2ec26e5672cef1958",
                "sha256": "889b9ac688834b3fc61b52e2aad15ccad55e856280eb4ac9573fdca0c8b11c82"
            },
            "downloads": -1,
            "filename": "zensols.edusenti-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "03edc904c1f2f2f2ec26e5672cef1958",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 8588,
            "upload_time": "2024-05-19T22:04:34",
            "upload_time_iso_8601": "2024-05-19T22:04:34.034152Z",
            "url": "https://files.pythonhosted.org/packages/29/2d/b27277033ad7dbe1c004cd4650d1d7eb3164669dc680b8083ab623cb85a0/zensols.edusenti-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-19 22:04:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "plandes",
    "github_project": "edusenti",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "zensols.edusenti"
}

Paul Landes