# transformer-smaller-training-vocab
[![PyPI version](https://badge.fury.io/py/transformer-smaller-training-vocab.svg)](https://badge.fury.io/py/transformer-smaller-training-vocab)
[![GitHub Issues](https://img.shields.io/github/issues/helpmefindaname/transformer-smaller-training-vocab.svg)](https://github.com/helpmefindaname/transformer-smaller-training-vocab/issues)
[![License: MIT](https://img.shields.io/badge/License-MIT-brightgreen.svg)](https://opensource.org/licenses/MIT)
**Docs** are available [here](https://helpmefindaname.github.io/transformer-smaller-training-vocab)
## Motivation
Have you ever trained a transformer model and noticed that most tokens in the vocab are not used?
Logically the token embeddings from those terms won't change, however they still take up memory and compute resources on your GPU.
One could assume that the embeddings are just a small part of the model and therefore aren't relevant, however looking at models like [xlm-roberta-large](https://huggingface.co/xlm-roberta-large) have 45.72% of parameters as "word_embeddings".
Besides that, the gradient computation is done for the whole embedding weight, leading to gradient updates with very large amounts of 0s, eating a lot of memory, especially with state optimizers such as adam.
To reduce these inconveniences, this package provides a simple and easy to use way to
* gather usage statistics of the vocabulary
* temporary reduce the vocabulary to include no tokens that won't be used during training
* fit in the tokens back in after the training is finished, so the full version can be saved.
### Limitations
This library works fine, if you use any [FastTokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizerFast)
However if you want to use a `slow` tokenizer, it get's more tricky as huggingface-transformers has - per current date - no interface for overwriting the vocabulary in transformers.
So they require a custom implementation, currently the following tokenizers are supported:
* XLMRobertaTokenizer
* RobertaTokenizer
* BertTokenizer
If you want to use a tokenizer that is not on the list, please [create an issue](https://github.com/helpmefindaname/transformer-smaller-training-vocab/issues) for it.
## Quick Start
### Requirements and Installation
The project is based on Transformers 4.1.0+, PyTorch 1.8+ and Python 3.8+
Then, in your favorite virtual environment, simply run:
```
pip install transformer-smaller-training-vocab
```
### Example Usage
To use more efficient training, it is enough to do the following changes to an abitary training script:
```diff
model = ...
tokenizer = ...
raw_datasets = ...
...
+ with reduce_train_vocab(model=model, tokenizer=tokenizer, texts=get_texts_from_dataset(raw_datasets, key="text")):
def preprocess_function(examples):
result = tokenizer(examples["text"], padding=padding, max_length=max_seq_length, truncation=True)
result["label"] = [(label_to_id[l] if l != -1 else -1) for l in examples["label"]]
return result
raw_datasets = raw_datasets.map(
preprocess_function,
batched=True,
)
trainer = Trainer(
model=model,
train_dataset=raw_datasets["train"],
eval_dataset=raw_datasets["validation"],
tokenizer=tokenizer,
...
)
trainer.train()
+ trainer.save_model() # save model at the end to contain the full vocab again.
```
Done! The Model will now be trained with only use the necessary parts of the token embeddings.
## Impact
Here is a table to document how much impact this technique has on training:
| **Model** | **Dataset** | **Vocab reduction** | **Model size reduction** |
|-----------|-------------|---------------------|--------------------------|
| [xlm-roberta-large](https://huggingface.co/xlm-roberta-large) | CONLL 03 (en) | 93.13% | 42.58% |
| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | CONLL 03 (en) | 93.13% | 64.31% |
| [bert-base-cased](https://huggingface.co/bert-base-cased) | CONLL 03 (en) | 43.64% | 08.97% |
| [bert-base-uncased](https://huggingface.co/bert-base-uncased) | CONLL 03 (en) | 47.62% | 10.19% |
| [bert-large-uncased](https://huggingface.co/roberta-base) | CONLL 03 (en) | 47.62% | 04.44% |
| [roberta-base](https://huggingface.co/roberta-base) | CONLL 03 (en) | 58.39% | 18.08% |
| [roberta-large](https://huggingface.co/roberta-large) | CONLL 03 (en) | 58.39% | 08.45% |
| [bert-base-cased](https://huggingface.co/bert-base-cased) | cola | 77.67% | 15.97% |
| [roberta-base](https://huggingface.co/roberta-base) | cola | 86.08% | 26.66% |
| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | cola | 97.79% | 67.52% |
| [bert-base-cased](https://huggingface.co/bert-base-cased) | mnli | 10.94% | 2.25% |
| [roberta-base](https://huggingface.co/roberta-base) | mnli | 14.78% | 4.58% |
| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | mnli | 88.83% | 61.34% |
| [bert-base-cased](https://huggingface.co/bert-base-cased) | mrpc | 49.93% | 10.27% |
| [roberta-base](https://huggingface.co/roberta-base) | mrpc | 64.02% | 19.83% |
| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | mrpc | 94.88% | 65.52% |
| [bert-base-cased](https://huggingface.co/bert-base-cased) | qnli | 8.62% | 1.77% |
| [roberta-base](https://huggingface.co/roberta-base) | qnli | 17.64% | 5.46% |
| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | qnli | 87.57% | 60.47% |
| [bert-base-cased](https://huggingface.co/bert-base-cased) | qqp | 7.69% | 1.58% |
| [roberta-base](https://huggingface.co/roberta-base) | qqp | 5.91% | 1.83% |
| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | qqp | 85.40% | 58.98% |
| [bert-base-cased](https://huggingface.co/bert-base-cased) | rte | 34.68% | 7.13% |
| [roberta-base](https://huggingface.co/roberta-base) | rte | 50.49% | 15.64% |
| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | rte | 93.10% | 64.29% |
| [bert-base-cased](https://huggingface.co/bert-base-cased) | sst2 | 62.39% | 12.83% |
| [roberta-base](https://huggingface.co/roberta-base) | sst2 | 68.60% | 21.25% |
| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | sst2 | 96.25% | 66.47% |
| [bert-base-cased](https://huggingface.co/bert-base-cased) | stsb | 51.35% | 10.56% |
| [roberta-base](https://huggingface.co/roberta-base) | stsb | 64.37% | 19.93% |
| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | stsb | 94.88% | 65.52% |
| [bert-base-cased](https://huggingface.co/bert-base-cased) | wnli | 93.66% | 19.26% |
| [roberta-base](https://huggingface.co/roberta-base) | wnli | 96.03% | 29.74% |
| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | wnli | 99.25% | 68.54% |
Notice that while those reduced embeddings imply slightly less computation effort, those gains are neglectable, as the gradient computation for the parameters of transformer layers are dominant.
Raw data
{
"_id": null,
"home_page": "https://github.com/helpmefindaname/transformer-smaller-training-vocab",
"name": "transformer-smaller-training-vocab",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Benedikt Fuchs",
"author_email": "benedikt.fuchs.staw@hotmail.com",
"download_url": "https://files.pythonhosted.org/packages/03/28/5214ff2a93d9cccc55d0679cb8cbc3b9d52f1f07860c92841b16cfeb5026/transformer_smaller_training_vocab-0.4.0.tar.gz",
"platform": null,
"description": "# transformer-smaller-training-vocab\n\n[![PyPI version](https://badge.fury.io/py/transformer-smaller-training-vocab.svg)](https://badge.fury.io/py/transformer-smaller-training-vocab)\n[![GitHub Issues](https://img.shields.io/github/issues/helpmefindaname/transformer-smaller-training-vocab.svg)](https://github.com/helpmefindaname/transformer-smaller-training-vocab/issues)\n[![License: MIT](https://img.shields.io/badge/License-MIT-brightgreen.svg)](https://opensource.org/licenses/MIT)\n\n**Docs** are available [here](https://helpmefindaname.github.io/transformer-smaller-training-vocab)\n\n## Motivation\n\nHave you ever trained a transformer model and noticed that most tokens in the vocab are not used?\nLogically the token embeddings from those terms won't change, however they still take up memory and compute resources on your GPU.\nOne could assume that the embeddings are just a small part of the model and therefore aren't relevant, however looking at models like [xlm-roberta-large](https://huggingface.co/xlm-roberta-large) have 45.72% of parameters as \"word_embeddings\".\nBesides that, the gradient computation is done for the whole embedding weight, leading to gradient updates with very large amounts of 0s, eating a lot of memory, especially with state optimizers such as adam.\n\nTo reduce these inconveniences, this package provides a simple and easy to use way to\n* gather usage statistics of the vocabulary\n* temporary reduce the vocabulary to include no tokens that won't be used during training\n* fit in the tokens back in after the training is finished, so the full version can be saved.\n\n\n### Limitations\n\nThis library works fine, if you use any [FastTokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizerFast)\nHowever if you want to use a `slow` tokenizer, it get's more tricky as huggingface-transformers has - per current date - no interface for overwriting the vocabulary in transformers.\nSo they require a custom implementation, currently the following tokenizers are supported:\n* XLMRobertaTokenizer\n* RobertaTokenizer\n* BertTokenizer\n\nIf you want to use a tokenizer that is not on the list, please [create an issue](https://github.com/helpmefindaname/transformer-smaller-training-vocab/issues) for it.\n\n## Quick Start\n\n### Requirements and Installation\n\nThe project is based on Transformers 4.1.0+, PyTorch 1.8+ and Python 3.8+\nThen, in your favorite virtual environment, simply run:\n\n```\npip install transformer-smaller-training-vocab\n```\n\n### Example Usage\n\nTo use more efficient training, it is enough to do the following changes to an abitary training script:\n\n```diff\n\n model = ...\n tokenizer = ...\n raw_datasets = ...\n ...\n\n+ with reduce_train_vocab(model=model, tokenizer=tokenizer, texts=get_texts_from_dataset(raw_datasets, key=\"text\")):\n def preprocess_function(examples):\n result = tokenizer(examples[\"text\"], padding=padding, max_length=max_seq_length, truncation=True)\n result[\"label\"] = [(label_to_id[l] if l != -1 else -1) for l in examples[\"label\"]]\n return result\n \n raw_datasets = raw_datasets.map(\n preprocess_function,\n batched=True,\n )\n \n trainer = Trainer(\n model=model,\n train_dataset=raw_datasets[\"train\"],\n eval_dataset=raw_datasets[\"validation\"],\n tokenizer=tokenizer,\n ...\n )\n \n trainer.train()\n\n+ trainer.save_model() # save model at the end to contain the full vocab again.\n```\n\nDone! The Model will now be trained with only use the necessary parts of the token embeddings.\n\n## Impact\n\nHere is a table to document how much impact this technique has on training:\n\n| **Model** | **Dataset** | **Vocab reduction** | **Model size reduction** |\n|-----------|-------------|---------------------|--------------------------|\n| [xlm-roberta-large](https://huggingface.co/xlm-roberta-large) | CONLL 03 (en) | 93.13% | 42.58% |\n| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | CONLL 03 (en) | 93.13% | 64.31% |\n| [bert-base-cased](https://huggingface.co/bert-base-cased) | CONLL 03 (en) | 43.64% | 08.97% |\n| [bert-base-uncased](https://huggingface.co/bert-base-uncased) | CONLL 03 (en) | 47.62% | 10.19% |\n| [bert-large-uncased](https://huggingface.co/roberta-base) | CONLL 03 (en) | 47.62% | 04.44% |\n| [roberta-base](https://huggingface.co/roberta-base) | CONLL 03 (en) | 58.39% | 18.08% |\n| [roberta-large](https://huggingface.co/roberta-large) | CONLL 03 (en) | 58.39% | 08.45% |\n| [bert-base-cased](https://huggingface.co/bert-base-cased) | cola | 77.67% | 15.97% |\n| [roberta-base](https://huggingface.co/roberta-base) | cola | 86.08% | 26.66% |\n| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | cola | 97.79% | 67.52% |\n| [bert-base-cased](https://huggingface.co/bert-base-cased) | mnli | 10.94% | 2.25% |\n| [roberta-base](https://huggingface.co/roberta-base) | mnli | 14.78% | 4.58% |\n| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | mnli | 88.83% | 61.34% |\n| [bert-base-cased](https://huggingface.co/bert-base-cased) | mrpc | 49.93% | 10.27% |\n| [roberta-base](https://huggingface.co/roberta-base) | mrpc | 64.02% | 19.83% |\n| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | mrpc | 94.88% | 65.52% |\n| [bert-base-cased](https://huggingface.co/bert-base-cased) | qnli | 8.62% | 1.77% |\n| [roberta-base](https://huggingface.co/roberta-base) | qnli | 17.64% | 5.46% |\n| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | qnli | 87.57% | 60.47% |\n| [bert-base-cased](https://huggingface.co/bert-base-cased) | qqp | 7.69% | 1.58% |\n| [roberta-base](https://huggingface.co/roberta-base) | qqp | 5.91% | 1.83% |\n| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | qqp | 85.40% | 58.98% |\n| [bert-base-cased](https://huggingface.co/bert-base-cased) | rte | 34.68% | 7.13% |\n| [roberta-base](https://huggingface.co/roberta-base) | rte | 50.49% | 15.64% |\n| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | rte | 93.10% | 64.29% |\n| [bert-base-cased](https://huggingface.co/bert-base-cased) | sst2 | 62.39% | 12.83% |\n| [roberta-base](https://huggingface.co/roberta-base) | sst2 | 68.60% | 21.25% |\n| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | sst2 | 96.25% | 66.47% |\n| [bert-base-cased](https://huggingface.co/bert-base-cased) | stsb | 51.35% | 10.56% |\n| [roberta-base](https://huggingface.co/roberta-base) | stsb | 64.37% | 19.93% |\n| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | stsb | 94.88% | 65.52% |\n| [bert-base-cased](https://huggingface.co/bert-base-cased) | wnli | 93.66% | 19.26% |\n| [roberta-base](https://huggingface.co/roberta-base) | wnli | 96.03% | 29.74% |\n| [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | wnli | 99.25% | 68.54% |\n\n\nNotice that while those reduced embeddings imply slightly less computation effort, those gains are neglectable, as the gradient computation for the parameters of transformer layers are dominant.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Temporary remove unused tokens during training to save ram and speed.",
"version": "0.4.0",
"project_urls": {
"Homepage": "https://github.com/helpmefindaname/transformer-smaller-training-vocab",
"Repository": "https://github.com/helpmefindaname/transformer-smaller-training-vocab"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "55c86a02e88256dc48faf3eae5732a94035f4ea0edd91d8224333693111921ba",
"md5": "1f6acea9e5eb9610d79b10df09e9b8a7",
"sha256": "01cb3d8f4818121172e1591a06c3149bf49bc18d6f6f269eb42d2c4ed155cfcc"
},
"downloads": -1,
"filename": "transformer_smaller_training_vocab-0.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1f6acea9e5eb9610d79b10df09e9b8a7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.8",
"size": 14120,
"upload_time": "2024-04-05T15:31:14",
"upload_time_iso_8601": "2024-04-05T15:31:14.217453Z",
"url": "https://files.pythonhosted.org/packages/55/c8/6a02e88256dc48faf3eae5732a94035f4ea0edd91d8224333693111921ba/transformer_smaller_training_vocab-0.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "03285214ff2a93d9cccc55d0679cb8cbc3b9d52f1f07860c92841b16cfeb5026",
"md5": "3638dbf063f98f7ca986e5f312ae7867",
"sha256": "d7360ac084786f66f99ef16d621f34acbb0dce6d9a624525d1f7dc8b6c3a49f7"
},
"downloads": -1,
"filename": "transformer_smaller_training_vocab-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "3638dbf063f98f7ca986e5f312ae7867",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.8",
"size": 12141,
"upload_time": "2024-04-05T15:31:15",
"upload_time_iso_8601": "2024-04-05T15:31:15.336389Z",
"url": "https://files.pythonhosted.org/packages/03/28/5214ff2a93d9cccc55d0679cb8cbc3b9d52f1f07860c92841b16cfeb5026/transformer_smaller_training_vocab-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-05 15:31:15",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "helpmefindaname",
"github_project": "transformer-smaller-training-vocab",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "transformer-smaller-training-vocab"
}