zeldarose

Name	zeldarose JSON
Version	0.11.0 JSON
	download
home_page	None
Summary	Train transformer-based models
upload_time	2024-06-12 12:41:38
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	MIT
keywords	nlp transformers language-model
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            Zelda Rose
==========

[![Latest PyPI version](https://img.shields.io/pypi/v/zeldarose.svg)](https://pypi.org/project/zeldarose)
[![Build Status](https://github.com/LoicGrobol/zeldarose/actions/workflows/ci.yml/badge.svg)](https://github.com/LoicGrobol/zeldarose/actions?query=workflow%3ACI)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Documentation Status](https://readthedocs.org/projects/zeldarose/badge/?version=latest)](https://zeldarose.readthedocs.io/en/latest/?badge=latest)

A straightforward trainer for transformer-based models.

## Installation

Simply install with pipx

```bash
pipx install zeldarose
```

## Train MLM models

Here is a short example of training first a tokenizer, then a transformer MLM model:

```bash
TOKENIZERS_PARALLELISM=true zeldarose tokenizer --vocab-size 4096 --out-path local/tokenizer  --model-name "my-muppet" tests/fixtures/raw.txt
zeldarose 
transformer --tokenizer local/tokenizer --pretrained-model flaubert/flaubert_small_cased --out-dir local/muppet --val-text tests/fixtures/raw.txt tests/fixtures/raw.txt
```

The `.txt` files are meant to be raw text files, with one sample (e.g. sentence) per line.

There are other parameters (see `zeldarose transformer --help` for a comprehensive list), the one
you are probably mostly interested in is `--config`, giving the path to a training config (for which
we have [`examples/`](examples)).

The parameters `--pretrained-models`, `--tokenizer` and `--model-config` are all fed directly to
[Huggingface's `transformers`](https://huggingface.co/transformers) and can be [pretrained
models](https://huggingface.co/transformers/pretrained_models.html) names or local path.


## Distributed training

This is somewhat tricky, you have several options

- If you are running in a SLURM cluster use `--strategy ddp` and invoke via `srun`
  - You might want to preprocess your data first outside of the main compute allocation. The
    `--profile` option might be abused for that purpose, since it won't run a full training, but
    will run any data preprocessing you ask for. It might also be beneficial at this step to load a
    placeholder model such as
    [RoBERTa-minuscule](https://huggingface.co/lgrobol/roberta-minuscule/tree/main) to avoid runnin
    out of memory, since the only thing that matter for this preprocessing is the tokenizer.
- Otherwise you have two options

  - Run with `--strategy ddp_spawn`, which uses `multiprocessing.spawn` to start the process
    swarm (tested, but possibly slower and more limited, see `pytorch-lightning` doc)
  - Run with `--strategy ddp` and start with `torch.distributed.launch` with `--use_env` and
    `--no_python` (untested)

## Other hints

- Data management relies on 🤗 datasets and use their cache management system. To run in a clear
  environment, you might have to check the cache directory pointed to by the`HF_DATASETS_CACHE`
  environment variable.

## Inspirations

- <https://github.com/shoarora/lmtuners>
- <https://github.com/huggingface/transformers/blob/243e687be6cd701722cce050005a2181e78a08a8/examples/run_language_modeling.py>

## Citation

```bibtex
@inproceedings{grobol:hal-04262806,
    TITLE = {{Zelda Rose: a tool for hassle-free training of transformer models}},
    AUTHOR = {Grobol, Lo{\"i}c},
    URL = {https://hal.science/hal-04262806},
    BOOKTITLE = {{3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS)}},
    ADDRESS = {Singapore, Indonesia},
    YEAR = {2023},
    MONTH = Dec,
    PDF = {https://hal.science/hal-04262806/file/Zeldarose_OSS_EMNLP23.pdf},
    HAL_ID = {hal-04262806},
    HAL_VERSION = {v1},
}
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "zeldarose",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "nlp, transformers, language-model",
    "author": null,
    "author_email": "Lo\u00efc Grobol <loic.grobol@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/e0/40/4eeb83d31fe035c05170dae241e3d09b1fd9dc3989b295d7afa45d7126e5/zeldarose-0.11.0.tar.gz",
    "platform": null,
    "description": "Zelda Rose\n==========\n\n[![Latest PyPI version](https://img.shields.io/pypi/v/zeldarose.svg)](https://pypi.org/project/zeldarose)\n[![Build Status](https://github.com/LoicGrobol/zeldarose/actions/workflows/ci.yml/badge.svg)](https://github.com/LoicGrobol/zeldarose/actions?query=workflow%3ACI)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Documentation Status](https://readthedocs.org/projects/zeldarose/badge/?version=latest)](https://zeldarose.readthedocs.io/en/latest/?badge=latest)\n\nA straightforward trainer for transformer-based models.\n\n## Installation\n\nSimply install with pipx\n\n```bash\npipx install zeldarose\n```\n\n## Train MLM models\n\nHere is a short example of training first a tokenizer, then a transformer MLM model:\n\n```bash\nTOKENIZERS_PARALLELISM=true zeldarose tokenizer --vocab-size 4096 --out-path local/tokenizer  --model-name \"my-muppet\" tests/fixtures/raw.txt\nzeldarose \ntransformer --tokenizer local/tokenizer --pretrained-model flaubert/flaubert_small_cased --out-dir local/muppet --val-text tests/fixtures/raw.txt tests/fixtures/raw.txt\n```\n\nThe `.txt` files are meant to be raw text files, with one sample (e.g. sentence) per line.\n\nThere are other parameters (see `zeldarose transformer --help` for a comprehensive list), the one\nyou are probably mostly interested in is `--config`, giving the path to a training config (for which\nwe have [`examples/`](examples)).\n\nThe parameters `--pretrained-models`, `--tokenizer` and `--model-config` are all fed directly to\n[Huggingface's `transformers`](https://huggingface.co/transformers) and can be [pretrained\nmodels](https://huggingface.co/transformers/pretrained_models.html) names or local path.\n\n\n## Distributed training\n\nThis is somewhat tricky, you have several options\n\n- If you are running in a SLURM cluster use `--strategy ddp` and invoke via `srun`\n  - You might want to preprocess your data first outside of the main compute allocation. The\n    `--profile` option might be abused for that purpose, since it won't run a full training, but\n    will run any data preprocessing you ask for. It might also be beneficial at this step to load a\n    placeholder model such as\n    [RoBERTa-minuscule](https://huggingface.co/lgrobol/roberta-minuscule/tree/main) to avoid runnin\n    out of memory, since the only thing that matter for this preprocessing is the tokenizer.\n- Otherwise you have two options\n\n  - Run with `--strategy ddp_spawn`, which uses `multiprocessing.spawn` to start the process\n    swarm (tested, but possibly slower and more limited, see `pytorch-lightning` doc)\n  - Run with `--strategy ddp` and start with `torch.distributed.launch` with `--use_env` and\n    `--no_python` (untested)\n\n## Other hints\n\n- Data management relies on \ud83e\udd17 datasets and use their cache management system. To run in a clear\n  environment, you might have to check the cache directory pointed to by the`HF_DATASETS_CACHE`\n  environment variable.\n\n## Inspirations\n\n- <https://github.com/shoarora/lmtuners>\n- <https://github.com/huggingface/transformers/blob/243e687be6cd701722cce050005a2181e78a08a8/examples/run_language_modeling.py>\n\n## Citation\n\n```bibtex\n@inproceedings{grobol:hal-04262806,\n    TITLE = {{Zelda Rose: a tool for hassle-free training of transformer models}},\n    AUTHOR = {Grobol, Lo{\\\"i}c},\n    URL = {https://hal.science/hal-04262806},\n    BOOKTITLE = {{3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS)}},\n    ADDRESS = {Singapore, Indonesia},\n    YEAR = {2023},\n    MONTH = Dec,\n    PDF = {https://hal.science/hal-04262806/file/Zeldarose_OSS_EMNLP23.pdf},\n    HAL_ID = {hal-04262806},\n    HAL_VERSION = {v1},\n}\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Train transformer-based models",
    "version": "0.11.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/loicgrobol/zeldarose/issues",
        "Changes": "https://github.com/loicgrobol/zeldarose/blob/main/CHANGELOG.md",
        "Documentation": "https://zeldarose.readthedocs.io",
        "Source Code": "https://github.com/loicgrobol/zeldarose"
    },
    "split_keywords": [
        "nlp",
        " transformers",
        " language-model"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ca1dfad01abba6e8232e70b1d93b2fd5d9c07c43a7dc5d713f97ce64d584b1d5",
                "md5": "232989071c458895c893c6fe3611a077",
                "sha256": "3f5da331686a795766ac0f2f89a7bad459ca75cd1d9e0e8ba34d72bc56e88935"
            },
            "downloads": -1,
            "filename": "zeldarose-0.11.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "232989071c458895c893c6fe3611a077",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 39639,
            "upload_time": "2024-06-12T12:41:37",
            "upload_time_iso_8601": "2024-06-12T12:41:37.416643Z",
            "url": "https://files.pythonhosted.org/packages/ca/1d/fad01abba6e8232e70b1d93b2fd5d9c07c43a7dc5d713f97ce64d584b1d5/zeldarose-0.11.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e0404eeb83d31fe035c05170dae241e3d09b1fd9dc3989b295d7afa45d7126e5",
                "md5": "53fa84c0bcb5772a8ce308a295a2c3c0",
                "sha256": "c71b4744dfa16592bac461021607e75ac51efb371f9354f30f45a9c157dd194c"
            },
            "downloads": -1,
            "filename": "zeldarose-0.11.0.tar.gz",
            "has_sig": false,
            "md5_digest": "53fa84c0bcb5772a8ce308a295a2c3c0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 31810,
            "upload_time": "2024-06-12T12:41:38",
            "upload_time_iso_8601": "2024-06-12T12:41:38.552372Z",
            "url": "https://files.pythonhosted.org/packages/e0/40/4eeb83d31fe035c05170dae241e3d09b1fd9dc3989b295d7afa45d7126e5/zeldarose-0.11.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-12 12:41:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "loicgrobol",
    "github_project": "zeldarose",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "zeldarose"
}

None