Name | zeldarose JSON |
Version |
0.11.0
JSON |
| download |
home_page | None |
Summary | Train transformer-based models |
upload_time | 2024-06-12 12:41:38 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.8 |
license | MIT |
keywords |
nlp
transformers
language-model
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
Zelda Rose
==========
[![Latest PyPI version](https://img.shields.io/pypi/v/zeldarose.svg)](https://pypi.org/project/zeldarose)
[![Build Status](https://github.com/LoicGrobol/zeldarose/actions/workflows/ci.yml/badge.svg)](https://github.com/LoicGrobol/zeldarose/actions?query=workflow%3ACI)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Documentation Status](https://readthedocs.org/projects/zeldarose/badge/?version=latest)](https://zeldarose.readthedocs.io/en/latest/?badge=latest)
A straightforward trainer for transformer-based models.
## Installation
Simply install with pipx
```bash
pipx install zeldarose
```
## Train MLM models
Here is a short example of training first a tokenizer, then a transformer MLM model:
```bash
TOKENIZERS_PARALLELISM=true zeldarose tokenizer --vocab-size 4096 --out-path local/tokenizer --model-name "my-muppet" tests/fixtures/raw.txt
zeldarose
transformer --tokenizer local/tokenizer --pretrained-model flaubert/flaubert_small_cased --out-dir local/muppet --val-text tests/fixtures/raw.txt tests/fixtures/raw.txt
```
The `.txt` files are meant to be raw text files, with one sample (e.g. sentence) per line.
There are other parameters (see `zeldarose transformer --help` for a comprehensive list), the one
you are probably mostly interested in is `--config`, giving the path to a training config (for which
we have [`examples/`](examples)).
The parameters `--pretrained-models`, `--tokenizer` and `--model-config` are all fed directly to
[Huggingface's `transformers`](https://huggingface.co/transformers) and can be [pretrained
models](https://huggingface.co/transformers/pretrained_models.html) names or local path.
## Distributed training
This is somewhat tricky, you have several options
- If you are running in a SLURM cluster use `--strategy ddp` and invoke via `srun`
- You might want to preprocess your data first outside of the main compute allocation. The
`--profile` option might be abused for that purpose, since it won't run a full training, but
will run any data preprocessing you ask for. It might also be beneficial at this step to load a
placeholder model such as
[RoBERTa-minuscule](https://huggingface.co/lgrobol/roberta-minuscule/tree/main) to avoid runnin
out of memory, since the only thing that matter for this preprocessing is the tokenizer.
- Otherwise you have two options
- Run with `--strategy ddp_spawn`, which uses `multiprocessing.spawn` to start the process
swarm (tested, but possibly slower and more limited, see `pytorch-lightning` doc)
- Run with `--strategy ddp` and start with `torch.distributed.launch` with `--use_env` and
`--no_python` (untested)
## Other hints
- Data management relies on 🤗 datasets and use their cache management system. To run in a clear
environment, you might have to check the cache directory pointed to by the`HF_DATASETS_CACHE`
environment variable.
## Inspirations
- <https://github.com/shoarora/lmtuners>
- <https://github.com/huggingface/transformers/blob/243e687be6cd701722cce050005a2181e78a08a8/examples/run_language_modeling.py>
## Citation
```bibtex
@inproceedings{grobol:hal-04262806,
TITLE = {{Zelda Rose: a tool for hassle-free training of transformer models}},
AUTHOR = {Grobol, Lo{\"i}c},
URL = {https://hal.science/hal-04262806},
BOOKTITLE = {{3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS)}},
ADDRESS = {Singapore, Indonesia},
YEAR = {2023},
MONTH = Dec,
PDF = {https://hal.science/hal-04262806/file/Zeldarose_OSS_EMNLP23.pdf},
HAL_ID = {hal-04262806},
HAL_VERSION = {v1},
}
```
Raw data
{
"_id": null,
"home_page": null,
"name": "zeldarose",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "nlp, transformers, language-model",
"author": null,
"author_email": "Lo\u00efc Grobol <loic.grobol@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/e0/40/4eeb83d31fe035c05170dae241e3d09b1fd9dc3989b295d7afa45d7126e5/zeldarose-0.11.0.tar.gz",
"platform": null,
"description": "Zelda Rose\n==========\n\n[![Latest PyPI version](https://img.shields.io/pypi/v/zeldarose.svg)](https://pypi.org/project/zeldarose)\n[![Build Status](https://github.com/LoicGrobol/zeldarose/actions/workflows/ci.yml/badge.svg)](https://github.com/LoicGrobol/zeldarose/actions?query=workflow%3ACI)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Documentation Status](https://readthedocs.org/projects/zeldarose/badge/?version=latest)](https://zeldarose.readthedocs.io/en/latest/?badge=latest)\n\nA straightforward trainer for transformer-based models.\n\n## Installation\n\nSimply install with pipx\n\n```bash\npipx install zeldarose\n```\n\n## Train MLM models\n\nHere is a short example of training first a tokenizer, then a transformer MLM model:\n\n```bash\nTOKENIZERS_PARALLELISM=true zeldarose tokenizer --vocab-size 4096 --out-path local/tokenizer --model-name \"my-muppet\" tests/fixtures/raw.txt\nzeldarose \ntransformer --tokenizer local/tokenizer --pretrained-model flaubert/flaubert_small_cased --out-dir local/muppet --val-text tests/fixtures/raw.txt tests/fixtures/raw.txt\n```\n\nThe `.txt` files are meant to be raw text files, with one sample (e.g. sentence) per line.\n\nThere are other parameters (see `zeldarose transformer --help` for a comprehensive list), the one\nyou are probably mostly interested in is `--config`, giving the path to a training config (for which\nwe have [`examples/`](examples)).\n\nThe parameters `--pretrained-models`, `--tokenizer` and `--model-config` are all fed directly to\n[Huggingface's `transformers`](https://huggingface.co/transformers) and can be [pretrained\nmodels](https://huggingface.co/transformers/pretrained_models.html) names or local path.\n\n\n## Distributed training\n\nThis is somewhat tricky, you have several options\n\n- If you are running in a SLURM cluster use `--strategy ddp` and invoke via `srun`\n - You might want to preprocess your data first outside of the main compute allocation. The\n `--profile` option might be abused for that purpose, since it won't run a full training, but\n will run any data preprocessing you ask for. It might also be beneficial at this step to load a\n placeholder model such as\n [RoBERTa-minuscule](https://huggingface.co/lgrobol/roberta-minuscule/tree/main) to avoid runnin\n out of memory, since the only thing that matter for this preprocessing is the tokenizer.\n- Otherwise you have two options\n\n - Run with `--strategy ddp_spawn`, which uses `multiprocessing.spawn` to start the process\n swarm (tested, but possibly slower and more limited, see `pytorch-lightning` doc)\n - Run with `--strategy ddp` and start with `torch.distributed.launch` with `--use_env` and\n `--no_python` (untested)\n\n## Other hints\n\n- Data management relies on \ud83e\udd17 datasets and use their cache management system. To run in a clear\n environment, you might have to check the cache directory pointed to by the`HF_DATASETS_CACHE`\n environment variable.\n\n## Inspirations\n\n- <https://github.com/shoarora/lmtuners>\n- <https://github.com/huggingface/transformers/blob/243e687be6cd701722cce050005a2181e78a08a8/examples/run_language_modeling.py>\n\n## Citation\n\n```bibtex\n@inproceedings{grobol:hal-04262806,\n TITLE = {{Zelda Rose: a tool for hassle-free training of transformer models}},\n AUTHOR = {Grobol, Lo{\\\"i}c},\n URL = {https://hal.science/hal-04262806},\n BOOKTITLE = {{3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS)}},\n ADDRESS = {Singapore, Indonesia},\n YEAR = {2023},\n MONTH = Dec,\n PDF = {https://hal.science/hal-04262806/file/Zeldarose_OSS_EMNLP23.pdf},\n HAL_ID = {hal-04262806},\n HAL_VERSION = {v1},\n}\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Train transformer-based models",
"version": "0.11.0",
"project_urls": {
"Bug Tracker": "https://github.com/loicgrobol/zeldarose/issues",
"Changes": "https://github.com/loicgrobol/zeldarose/blob/main/CHANGELOG.md",
"Documentation": "https://zeldarose.readthedocs.io",
"Source Code": "https://github.com/loicgrobol/zeldarose"
},
"split_keywords": [
"nlp",
" transformers",
" language-model"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ca1dfad01abba6e8232e70b1d93b2fd5d9c07c43a7dc5d713f97ce64d584b1d5",
"md5": "232989071c458895c893c6fe3611a077",
"sha256": "3f5da331686a795766ac0f2f89a7bad459ca75cd1d9e0e8ba34d72bc56e88935"
},
"downloads": -1,
"filename": "zeldarose-0.11.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "232989071c458895c893c6fe3611a077",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 39639,
"upload_time": "2024-06-12T12:41:37",
"upload_time_iso_8601": "2024-06-12T12:41:37.416643Z",
"url": "https://files.pythonhosted.org/packages/ca/1d/fad01abba6e8232e70b1d93b2fd5d9c07c43a7dc5d713f97ce64d584b1d5/zeldarose-0.11.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e0404eeb83d31fe035c05170dae241e3d09b1fd9dc3989b295d7afa45d7126e5",
"md5": "53fa84c0bcb5772a8ce308a295a2c3c0",
"sha256": "c71b4744dfa16592bac461021607e75ac51efb371f9354f30f45a9c157dd194c"
},
"downloads": -1,
"filename": "zeldarose-0.11.0.tar.gz",
"has_sig": false,
"md5_digest": "53fa84c0bcb5772a8ce308a295a2c3c0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 31810,
"upload_time": "2024-06-12T12:41:38",
"upload_time_iso_8601": "2024-06-12T12:41:38.552372Z",
"url": "https://files.pythonhosted.org/packages/e0/40/4eeb83d31fe035c05170dae241e3d09b1fd9dc3989b295d7afa45d7126e5/zeldarose-0.11.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-12 12:41:38",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "loicgrobol",
"github_project": "zeldarose",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "zeldarose"
}