turkish-lm-tuner

Name	turkish-lm-tuner JSON
Version	0.1.4 JSON
	download
home_page	None
Summary	Implementation of the Turkish LM Tuner
upload_time	2024-11-17 13:28:24
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	Apache-2.0
keywords	nlp turkish language models finetuning
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <h1 align="center">  🦖 Turkish LM Tuner </h1>
<!--<h4 align="center"> Summary of project or library comes here. </h4>-->

</br>

[![Paper](https://img.shields.io/badge/DOI-10.18653/v1/2024.findings--acl.600-blue)](https://aclanthology.org/2024.findings-acl.600/)
[![Code license](https://img.shields.io/badge/Code%20License-MIT-green.svg)](https://github.com/boun-tabi-LMG/blob/main/LICENSE)
[![PyPI](https://img.shields.io/pypi/v/turkish-lm-tuner)](https://pypi.org/project/turkish-lm-tuner/)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/turkish-lm-tuner)](https://pypi.org/project/turkish-lm-tuner/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/turkish-lm-tuner)](https://pypi.org/project/turkish-lm-tuner/)
[![GitHub Repo stars](https://img.shields.io/github/stars/boun-tabi-LMG/turkish-lm-tuner)](https://github.com/boun-tabi-LMG/turkish-lm-tuner/stargazers)

## Overview

Turkish LM Tuner is a library for fine-tuning Turkish language models on various NLP tasks. It is built on top of [Hugging Face Transformers](https://github.com/huggingface/transformers) library. It supports finetuning with conditional generation and sequence classification tasks. The library is designed to be modular and extensible. It is easy to add new tasks and models. The library also provides data loaders for various Turkish NLP datasets.

## Installation

You can install `turkish-lm-tuner` via PyPI: 

```bash

pip install turkish-lm-tuner
```

Alternatively, you can use the following command to install the library:

```bash

pip install git+https://github.com/boun-tabi-LMG/turkish-lm-tuner.git
```

## Model Support

Any Encoder or ConditionalGeneration model that is compatible with Hugging Face Transformers library can be used with Turkish LM Tuner. The following models are tested and supported.

- [TURNA](https://arxiv.org/abs/2401.14373)
- [mT5](https://aclanthology.org/2021.naacl-main.41/)
- [mBART](https://aclanthology.org/2020.tacl-1.47/)
- [BERTurk](https://github.com/stefan-it/turkish-bert)

## Task and Dataset Support

| Task                           | Datasets                                                                                                 |
| ------------------------------ | --------------------------------------------------------------------------------------------------------                                                                                                             |
| Text Classification            | [Product Reviews](https://huggingface.co/datasets/turkish_product_reviews), [TTC4900](https://dx.doi.org/10.5505/pajes.2018.15931), [Tweet Sentiment](https://ieeexplore.ieee.org/document/8554037)                  |                                                                                                                                 |
| Natural Language Inference     | [NLI_TR](https://aclanthology.org/2020.emnlp-main.662/), [SNLI_TR](https://aclanthology.org/2020.emnlp-main.662/), [MultiNLI_TR](https://aclanthology.org/2020.emnlp-main.662/)                                      |
| Semantic Textual Similarity    | [STSb_TR](https://aclanthology.org/2021.gem-1.3/)                                                                                     |
| Named Entity Recognition       | [WikiANN](https://aclanthology.org/P19-1015/), [Milliyet NER](https://doi.org/10.1017/S135132490200284X)                                                          |
| Part-of-Speech Tagging         | [BOUN](https://universaldependencies.org/treebanks/tr_boun/index.html), [IMST](https://universaldependencies.org/treebanks/tr_imst/index.html)                                                                     |
| Text Summarization             | [TR News](https://doi.org/10.1007/s10579-021-09568-y), [MLSUM](https://aclanthology.org/2020.emnlp-main.647/), [Combined TR News and MLSUM](https://doi.org/10.1017/S1351324922000195)                        |
| Title Generation               | [TR News](https://doi.org/10.1007/s10579-021-09568-y), [MLSUM](https://aclanthology.org/2020.emnlp-main.647/), [Combined TR News and MLSUM](https://doi.org/10.1017/S1351324922000195)                        |
| Paraphrase Generation          | [OpenSubtitles](https://aclanthology.org/2022.icnlsp-1.14/), [Tatoeba](https://aclanthology.org/2022.icnlsp-1.14/), [TED Talks](https://aclanthology.org/2022.icnlsp-1.14/)                                 |


## Usage
The tutorials in the [documentation](docs/) can help you get started with `turkish-lm-tuner`.

## Examples

### Fine-tune and evaluate a conditional generation model

```python
from turkish_lm_tuner import DatasetProcessor, TrainerForConditionalGeneration

dataset_name = "tr_news"
task = "summarization"
task_format="conditional_generation"
model_name = "boun-tabi-LMG/TURNA"
max_input_length = 764
max_target_length = 128
dataset_processor = DatasetProcessor(
    dataset_name=dataset_name, task=task, task_format=task_format, task_mode='',
    tokenizer_name=model_name, max_input_length=max_input_length, max_target_length=max_target_length
)

train_dataset = dataset_processor.load_and_preprocess_data(split='train')
eval_dataset = dataset_processor.load_and_preprocess_data(split='validation')
test_dataset = dataset_processor.load_and_preprocess_data(split="test")

training_params = {
    'num_train_epochs': 10,
    'per_device_train_batch_size': 4,
    'per_device_eval_batch_size': 4,
    'output_dir': './', 
    'evaluation_strategy': 'epoch',
    'save_strategy': 'epoch',
    'predict_with_generate': True    
}
optimizer_params = {
    'optimizer_type': 'adafactor',
    'scheduler': False,
}

model_trainer = TrainerForConditionalGeneration(
    model_name=model_name, task=task,
    optimizer_params=optimizer_params,
    training_params=training_params,
    model_save_path="turna_summarization_tr_news",
    max_input_length=max_input_length,
    max_target_length=max_target_length, 
    postprocess_fn=dataset_processor.dataset.postprocess_data
)

trainer, model = model_trainer.train_and_evaluate(train_dataset, eval_dataset, test_dataset)

model.save_pretrained(model_save_path)
dataset_processor.tokenizer.save_pretrained(model_save_path)
```

### Evaluate a conditional generation model with custom generation config

```python
from turkish_lm_tuner import DatasetProcessor, EvaluatorForConditionalGeneration

dataset_name = "tr_news"
task = "summarization"
task_format="conditional_generation"
model_name = "boun-tabi-LMG/TURNA"
task_mode = ''
max_input_length = 764
max_target_length = 128
dataset_processor = DatasetProcessor(
    dataset_name, task, task_format, task_mode,
    model_name, max_input_length, max_target_length
)

test_dataset = dataset_processor.load_and_preprocess_data(split="test")

test_params = {
    'per_device_eval_batch_size': 4,
    'output_dir': './',
    'predict_with_generate': True
}

model_path = "turna_tr_news_summarization"
generation_params = {
    'num_beams': 4,
    'length_penalty': 2.0,
    'no_repeat_ngram_size': 3,
    'early_stopping': True,
    'max_length': 128,
    'min_length': 30,
}
evaluator = EvaluatorForConditionalGeneration(
    model_path, model_name, task, max_input_length, max_target_length, test_params,
    generation_params, dataset_processor.dataset.postprocess_data
)
results = evaluator.evaluate_model(test_dataset)
print(results)
```

## Reference

If you use this repository, please cite the following related [paper](https://aclanthology.org/2024.findings-acl.600/):

```bibtex
@inproceedings{uludogan-etal-2024-turna,
    title = "{TURNA}: A {T}urkish Encoder-Decoder Language Model for Enhanced Understanding and Generation",
    author = {Uludo{\u{g}}an, G{\"o}k{\c{c}}e  and
      Balal, Zeynep  and
      Akkurt, Furkan  and
      Turker, Meliksah  and
      Gungor, Onur  and
      {\"U}sk{\"u}darl{\i}, Susan},
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.600",
    doi = "10.18653/v1/2024.findings-acl.600",
    pages = "10103--10117",
}
```

## License

Note that all datasets belong to their respective owners. If you use the datasets provided by this library, please cite the original source.

This code base is licensed under the MIT license. See [LICENSE](license.md) for details.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "turkish-lm-tuner",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "nlp, turkish, language models, finetuning",
    "author": null,
    "author_email": "G\u00f6k\u00e7e Uludo\u011fan <gokceuludogan@gmail.com>, Zeynep Yirmibe\u015fo\u011flu Balal <yirmibesogluz@gmail.com>, Furkan Akkurt <furkanakkurt9285@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/47/c2/c0e28219f3ed8b293c3f314ca5cb228945492cb7ab49282365416643df61/turkish_lm_tuner-0.1.4.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\">  \ud83e\udd96 Turkish LM Tuner </h1>\n<!--<h4 align=\"center\"> Summary of project or library comes here. </h4>-->\n\n</br>\n\n[![Paper](https://img.shields.io/badge/DOI-10.18653/v1/2024.findings--acl.600-blue)](https://aclanthology.org/2024.findings-acl.600/)\n[![Code license](https://img.shields.io/badge/Code%20License-MIT-green.svg)](https://github.com/boun-tabi-LMG/blob/main/LICENSE)\n[![PyPI](https://img.shields.io/pypi/v/turkish-lm-tuner)](https://pypi.org/project/turkish-lm-tuner/)\n[![PyPI - Downloads](https://img.shields.io/pypi/dm/turkish-lm-tuner)](https://pypi.org/project/turkish-lm-tuner/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/turkish-lm-tuner)](https://pypi.org/project/turkish-lm-tuner/)\n[![GitHub Repo stars](https://img.shields.io/github/stars/boun-tabi-LMG/turkish-lm-tuner)](https://github.com/boun-tabi-LMG/turkish-lm-tuner/stargazers)\n\n## Overview\n\nTurkish LM Tuner is a library for fine-tuning Turkish language models on various NLP tasks. It is built on top of [Hugging Face Transformers](https://github.com/huggingface/transformers) library. It supports finetuning with conditional generation and sequence classification tasks. The library is designed to be modular and extensible. It is easy to add new tasks and models. The library also provides data loaders for various Turkish NLP datasets.\n\n## Installation\n\nYou can install `turkish-lm-tuner` via PyPI: \n\n```bash\n\npip install turkish-lm-tuner\n```\n\nAlternatively, you can use the following command to install the library:\n\n```bash\n\npip install git+https://github.com/boun-tabi-LMG/turkish-lm-tuner.git\n```\n\n## Model Support\n\nAny Encoder or ConditionalGeneration model that is compatible with Hugging Face Transformers library can be used with Turkish LM Tuner. The following models are tested and supported.\n\n- [TURNA](https://arxiv.org/abs/2401.14373)\n- [mT5](https://aclanthology.org/2021.naacl-main.41/)\n- [mBART](https://aclanthology.org/2020.tacl-1.47/)\n- [BERTurk](https://github.com/stefan-it/turkish-bert)\n\n## Task and Dataset Support\n\n| Task                           | Datasets                                                                                                 |\n| ------------------------------ | --------------------------------------------------------------------------------------------------------                                                                                                             |\n| Text Classification            | [Product Reviews](https://huggingface.co/datasets/turkish_product_reviews), [TTC4900](https://dx.doi.org/10.5505/pajes.2018.15931), [Tweet Sentiment](https://ieeexplore.ieee.org/document/8554037)                  |                                                                                                                                 |\n| Natural Language Inference     | [NLI_TR](https://aclanthology.org/2020.emnlp-main.662/), [SNLI_TR](https://aclanthology.org/2020.emnlp-main.662/), [MultiNLI_TR](https://aclanthology.org/2020.emnlp-main.662/)                                      |\n| Semantic Textual Similarity    | [STSb_TR](https://aclanthology.org/2021.gem-1.3/)                                                                                     |\n| Named Entity Recognition       | [WikiANN](https://aclanthology.org/P19-1015/), [Milliyet NER](https://doi.org/10.1017/S135132490200284X)                                                          |\n| Part-of-Speech Tagging         | [BOUN](https://universaldependencies.org/treebanks/tr_boun/index.html), [IMST](https://universaldependencies.org/treebanks/tr_imst/index.html)                                                                     |\n| Text Summarization             | [TR News](https://doi.org/10.1007/s10579-021-09568-y), [MLSUM](https://aclanthology.org/2020.emnlp-main.647/), [Combined TR News and MLSUM](https://doi.org/10.1017/S1351324922000195)                        |\n| Title Generation               | [TR News](https://doi.org/10.1007/s10579-021-09568-y), [MLSUM](https://aclanthology.org/2020.emnlp-main.647/), [Combined TR News and MLSUM](https://doi.org/10.1017/S1351324922000195)                        |\n| Paraphrase Generation          | [OpenSubtitles](https://aclanthology.org/2022.icnlsp-1.14/), [Tatoeba](https://aclanthology.org/2022.icnlsp-1.14/), [TED Talks](https://aclanthology.org/2022.icnlsp-1.14/)                                 |\n\n\n## Usage\nThe tutorials in the [documentation](docs/) can help you get started with `turkish-lm-tuner`.\n\n## Examples\n\n### Fine-tune and evaluate a conditional generation model\n\n```python\nfrom turkish_lm_tuner import DatasetProcessor, TrainerForConditionalGeneration\n\ndataset_name = \"tr_news\"\ntask = \"summarization\"\ntask_format=\"conditional_generation\"\nmodel_name = \"boun-tabi-LMG/TURNA\"\nmax_input_length = 764\nmax_target_length = 128\ndataset_processor = DatasetProcessor(\n    dataset_name=dataset_name, task=task, task_format=task_format, task_mode='',\n    tokenizer_name=model_name, max_input_length=max_input_length, max_target_length=max_target_length\n)\n\ntrain_dataset = dataset_processor.load_and_preprocess_data(split='train')\neval_dataset = dataset_processor.load_and_preprocess_data(split='validation')\ntest_dataset = dataset_processor.load_and_preprocess_data(split=\"test\")\n\ntraining_params = {\n    'num_train_epochs': 10,\n    'per_device_train_batch_size': 4,\n    'per_device_eval_batch_size': 4,\n    'output_dir': './', \n    'evaluation_strategy': 'epoch',\n    'save_strategy': 'epoch',\n    'predict_with_generate': True    \n}\noptimizer_params = {\n    'optimizer_type': 'adafactor',\n    'scheduler': False,\n}\n\nmodel_trainer = TrainerForConditionalGeneration(\n    model_name=model_name, task=task,\n    optimizer_params=optimizer_params,\n    training_params=training_params,\n    model_save_path=\"turna_summarization_tr_news\",\n    max_input_length=max_input_length,\n    max_target_length=max_target_length, \n    postprocess_fn=dataset_processor.dataset.postprocess_data\n)\n\ntrainer, model = model_trainer.train_and_evaluate(train_dataset, eval_dataset, test_dataset)\n\nmodel.save_pretrained(model_save_path)\ndataset_processor.tokenizer.save_pretrained(model_save_path)\n```\n\n### Evaluate a conditional generation model with custom generation config\n\n```python\nfrom turkish_lm_tuner import DatasetProcessor, EvaluatorForConditionalGeneration\n\ndataset_name = \"tr_news\"\ntask = \"summarization\"\ntask_format=\"conditional_generation\"\nmodel_name = \"boun-tabi-LMG/TURNA\"\ntask_mode = ''\nmax_input_length = 764\nmax_target_length = 128\ndataset_processor = DatasetProcessor(\n    dataset_name, task, task_format, task_mode,\n    model_name, max_input_length, max_target_length\n)\n\ntest_dataset = dataset_processor.load_and_preprocess_data(split=\"test\")\n\ntest_params = {\n    'per_device_eval_batch_size': 4,\n    'output_dir': './',\n    'predict_with_generate': True\n}\n\nmodel_path = \"turna_tr_news_summarization\"\ngeneration_params = {\n    'num_beams': 4,\n    'length_penalty': 2.0,\n    'no_repeat_ngram_size': 3,\n    'early_stopping': True,\n    'max_length': 128,\n    'min_length': 30,\n}\nevaluator = EvaluatorForConditionalGeneration(\n    model_path, model_name, task, max_input_length, max_target_length, test_params,\n    generation_params, dataset_processor.dataset.postprocess_data\n)\nresults = evaluator.evaluate_model(test_dataset)\nprint(results)\n```\n\n## Reference\n\nIf you use this repository, please cite the following related [paper](https://aclanthology.org/2024.findings-acl.600/):\n\n```bibtex\n@inproceedings{uludogan-etal-2024-turna,\n    title = \"{TURNA}: A {T}urkish Encoder-Decoder Language Model for Enhanced Understanding and Generation\",\n    author = {Uludo{\\u{g}}an, G{\\\"o}k{\\c{c}}e  and\n      Balal, Zeynep  and\n      Akkurt, Furkan  and\n      Turker, Meliksah  and\n      Gungor, Onur  and\n      {\\\"U}sk{\\\"u}darl{\\i}, Susan},\n    editor = \"Ku, Lun-Wei  and\n      Martins, Andre  and\n      Srikumar, Vivek\",\n    booktitle = \"Findings of the Association for Computational Linguistics: ACL 2024\",\n    month = aug,\n    year = \"2024\",\n    address = \"Bangkok, Thailand\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2024.findings-acl.600\",\n    doi = \"10.18653/v1/2024.findings-acl.600\",\n    pages = \"10103--10117\",\n}\n```\n\n## License\n\nNote that all datasets belong to their respective owners. If you use the datasets provided by this library, please cite the original source.\n\nThis code base is licensed under the MIT license. See [LICENSE](license.md) for details.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Implementation of the Turkish LM Tuner",
    "version": "0.1.4",
    "project_urls": {
        "Bug Tracker": "https://github.com/boun-tabi-LMG/turkish-lm-tuner/issues",
        "Documentation": "https://turkish-lm-tuner-docs.boun-tabi-LMG.github.io/",
        "Source Code": "https://github.com/boun-tabi-LMG/turkish-lm-tuner"
    },
    "split_keywords": [
        "nlp",
        " turkish",
        " language models",
        " finetuning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "67f7e7a31bfb1dfb5350ec0251b313376b762e07041c7216dd083940dd2ec1bd",
                "md5": "be44f16137b3ad6b3c1c9f6dbf20d5c8",
                "sha256": "f48b3351635bca275bec12f0e82cd5a6d31244d6f3f53b3608d08532174b6d24"
            },
            "downloads": -1,
            "filename": "turkish_lm_tuner-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "be44f16137b3ad6b3c1c9f6dbf20d5c8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 23422,
            "upload_time": "2024-11-17T13:28:22",
            "upload_time_iso_8601": "2024-11-17T13:28:22.267424Z",
            "url": "https://files.pythonhosted.org/packages/67/f7/e7a31bfb1dfb5350ec0251b313376b762e07041c7216dd083940dd2ec1bd/turkish_lm_tuner-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "47c2c0e28219f3ed8b293c3f314ca5cb228945492cb7ab49282365416643df61",
                "md5": "288b22e27dc08717cf89fda275eed80a",
                "sha256": "ea5a55e67f723c1fbbe58a9e5e27b42a62c7b9c8b4e290e086d668b7d42e30ac"
            },
            "downloads": -1,
            "filename": "turkish_lm_tuner-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "288b22e27dc08717cf89fda275eed80a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 43145,
            "upload_time": "2024-11-17T13:28:24",
            "upload_time_iso_8601": "2024-11-17T13:28:24.325445Z",
            "url": "https://files.pythonhosted.org/packages/47/c2/c0e28219f3ed8b293c3f314ca5cb228945492cb7ab49282365416643df61/turkish_lm_tuner-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-17 13:28:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "boun-tabi-LMG",
    "github_project": "turkish-lm-tuner",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "turkish-lm-tuner"
}

None