vocabtrimmer

Name	vocabtrimmer JSON
Version	0.0.2 JSON
	download
home_page	https://github.com/asahi417/lm-vocab-trmmer
Summary	Trimming vocabulary of pre-trained multilingual language models to language localization.
upload_time	2023-05-21 18:55:17
maintainer
docs_url	None
author	Asahi Ushio
requires_python	>=3.6
license	MIT License
keywords	language model t5 gpt3 bertnlp multilingual efficient-model
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![license](https://img.shields.io/badge/License-MIT-brightgreen.svg)](https://github.com/asahi417/lm-vocab-trimming/blob/master/LICENSE)
[![PyPI version](https://badge.fury.io/py/vocabtrimmer.svg)](https://badge.fury.io/py/vocabtrimmer)
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/vocabtrimmer.svg)](https://pypi.python.org/pypi/vocabtrimmer/)
[![PyPI status](https://img.shields.io/pypi/status/vocabtrimmer.svg)](https://pypi.python.org/pypi/vocabtrimmer/)

# Vocabulary Trimming

<p align="center">
  <img src="https://raw.githubusercontent.com/asahi417/lm-vocab-trimming/master/assets/overview.png" width="400">
  <br><em> Figure 1: An illustration of vocabulary trimming to Korean and French. </em>
</p>


***Vocabulary Trimming (VT)*** is a model compression technique, which reduces a multilingual LM vocabulary to a 
target language by deleting irrelevant tokens from its vocabulary (see Figure 1).
This repository contains a python-library `vocabtrimmer`, that remove irrelevant tokens from a multilingual LM vocabulary for the target language. 

<p align="center">
  <img src="https://raw.githubusercontent.com/asahi417/lm-vocab-trimming/master/assets/pie.png" width="400">
  <br><em> Figure 2: The ratio of the embedding matrix to the number of entire model parameters for each of multilingual LMs and the embedding matrix after VT with top-60 vocabulary. </em>
</p>

The motivation behind VT is that a multilingual LM has a huge vocabulary to cover all languages, that results in a large model size (see Figure 2). 
However, we don't need the bulk of those vocabularies, when we fine-tune the multilingual LM on a monolingual task in practice. Hence, 
we can delete such un-used vocabularies to reduce the model size.

In theory, VT can compress any existing multilingual LM to build monolingual LMs in any language covered by the multilingual LM. 
In our experiments, we show that VT can retain the original performance of the multilingual LM, while being smaller in size
(in general around 50% of the original vocabulary size is enough) than the original multilingual LM. 
The evaluation is performed over four NLP tasks (two generative and two classification tasks) among four widely used multilingual
LMs in seven languages. Finally, we show that this methodology can keep the best of both monolingual and multilingual 
worlds by keeping a small size as monolingual models without the need for specifically retraining them, and even 
limiting potentially harmful social biases. Please check those experimental results as wel as the technical detail in our paper,
["TBA"](paper-link). To reproduce the results in our paper, please check [here](https://github.com/asahi417/lm-vocab-trimmer/tree/main/experiments).


## Get Started 🚀

Let's install `lmqg` via pip first.
```shell
pip install vocabtrimmer
```

## Vocabulary Trimming with `vocabtrimmer`
<p align="center">
  <img src="https://raw.githubusercontent.com/asahi417/lm-vocab-trimming/master/assets/vt_type.png" width="400">
  <br><em> Figure 3: Comparisons of Pre-FT vs Post-FT in an example of fine-tuning on a task in French. </em>
</p>

As a default, VT relies on [mC4](https://huggingface.co/datasets/vocabtrimmer/mc4_validation), to find a set of language-specific 
tokens and the frequency of each token.
The practical usage of VT is to apply it to a multilingual LM before fine-tuning (pre-FT VT) or after fine-tuning (post-FT VT). 
Both should work well in general, but post-VT is more robust and it suits, if you already have a model as no additional training is required. 
Otherwise, pre-FT VT would be an option as it could reduce the time to fine-tune the model.
See the comparison of pre/post-FT VT in our [paper](paper-link).

### VT in Command-Line
The `vocabtrimmer` provides following command-line interface to trim a multilingual LM vocabulary.
```bash
vocabtrimmer-trimming -m MODEL -l LANGUAGE -p PATH_TO_SAVE [-v TARGET_VOCAB_SIZE] [--repo-id REPO_ID] 

arguments:
  -m, --model, model name on huggingface or path to local model
  -l, --language, language code of tokens to keep
  -p, --path-to-save, directly to save model
  -v, --target-vocab-size, [optinoal] vocab size after mining
  --repo-id, [optinoal] huggingface repo id to push after trimming
```
Following command trims the vocabulary of `google/mt5-small` to French with top-60k vocabulary. 
```bash
vocabtrimmer-trimming -m "google/mt5-small" -l "fr" -v 60000 -p "ckpts/mt5-small-trimmed-fr-60000"                       
```
The vocabulary size of multilingual LMs is usually 250k (XLM-R, mBART, mT5), and we recommend setting the target vocabulary size to 60k, 
the effective vocabulary size. Less vocabulary size than 60k may cause performance degradation, but can retain the original performance in some cases 
(check our [paper](paper-link)). If the target vocabulary size is not specified, it will use whole vocabulary that is appeared in the mC4 dataset or the specified target corpus.

### VT in Python
The `vocabtrimmer` provides an API to trim a multilingual LM via python.
Following command trims the vocabulary of `google/mt5-small` to French with top-60k vocabulary.
```python
import vocabtrimmer

trimmer = vocabtrimmer.VocabTrimmer("google/mt5-small")
trimmer.trim_vocab(
    path_to_save="ckpts/mt5-small-trimmed-fr-60000",
    language="fr",
    target_vocab_size=60000)
```

## Citation
Please cite following paper if you use any resource and see the code to reproduce the model if needed.

```
TBA
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/asahi417/lm-vocab-trmmer",
    "name": "vocabtrimmer",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "language model,t5,gpt3,bertnlp,multilingual,efficient-model",
    "author": "Asahi Ushio",
    "author_email": "asahi1992ushio@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/42/28/ad23df17bee7d4c8121a2dc157adbdc156cecc71e67bf6fd3b6d188f5977/vocabtrimmer-0.0.2.tar.gz",
    "platform": null,
    "description": "[![license](https://img.shields.io/badge/License-MIT-brightgreen.svg)](https://github.com/asahi417/lm-vocab-trimming/blob/master/LICENSE)\n[![PyPI version](https://badge.fury.io/py/vocabtrimmer.svg)](https://badge.fury.io/py/vocabtrimmer)\n[![PyPI pyversions](https://img.shields.io/pypi/pyversions/vocabtrimmer.svg)](https://pypi.python.org/pypi/vocabtrimmer/)\n[![PyPI status](https://img.shields.io/pypi/status/vocabtrimmer.svg)](https://pypi.python.org/pypi/vocabtrimmer/)\n\n# Vocabulary Trimming\n\n<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/asahi417/lm-vocab-trimming/master/assets/overview.png\" width=\"400\">\n  <br><em> Figure 1: An illustration of vocabulary trimming to Korean and French. </em>\n</p>\n\n\n***Vocabulary Trimming (VT)*** is a model compression technique, which reduces a multilingual LM vocabulary to a \ntarget language by deleting irrelevant tokens from its vocabulary (see Figure 1).\nThis repository contains a python-library `vocabtrimmer`, that remove irrelevant tokens from a multilingual LM vocabulary for the target language. \n\n<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/asahi417/lm-vocab-trimming/master/assets/pie.png\" width=\"400\">\n  <br><em> Figure 2: The ratio of the embedding matrix to the number of entire model parameters for each of multilingual LMs and the embedding matrix after VT with top-60 vocabulary. </em>\n</p>\n\nThe motivation behind VT is that a multilingual LM has a huge vocabulary to cover all languages, that results in a large model size (see Figure 2). \nHowever, we don't need the bulk of those vocabularies, when we fine-tune the multilingual LM on a monolingual task in practice. Hence, \nwe can delete such un-used vocabularies to reduce the model size.\n\nIn theory, VT can compress any existing multilingual LM to build monolingual LMs in any language covered by the multilingual LM. \nIn our experiments, we show that VT can retain the original performance of the multilingual LM, while being smaller in size\n(in general around 50% of the original vocabulary size is enough) than the original multilingual LM. \nThe evaluation is performed over four NLP tasks (two generative and two classification tasks) among four widely used multilingual\nLMs in seven languages. Finally, we show that this methodology can keep the best of both monolingual and multilingual \nworlds by keeping a small size as monolingual models without the need for specifically retraining them, and even \nlimiting potentially harmful social biases. Please check those experimental results as wel as the technical detail in our paper,\n[\"TBA\"](paper-link). To reproduce the results in our paper, please check [here](https://github.com/asahi417/lm-vocab-trimmer/tree/main/experiments).\n\n\n## Get Started \ud83d\ude80\n\nLet's install `lmqg` via pip first.\n```shell\npip install vocabtrimmer\n```\n\n## Vocabulary Trimming with `vocabtrimmer`\n<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/asahi417/lm-vocab-trimming/master/assets/vt_type.png\" width=\"400\">\n  <br><em> Figure 3: Comparisons of Pre-FT vs Post-FT in an example of fine-tuning on a task in French. </em>\n</p>\n\nAs a default, VT relies on [mC4](https://huggingface.co/datasets/vocabtrimmer/mc4_validation), to find a set of language-specific \ntokens and the frequency of each token.\nThe practical usage of VT is to apply it to a multilingual LM before fine-tuning (pre-FT VT) or after fine-tuning (post-FT VT). \nBoth should work well in general, but post-VT is more robust and it suits, if you already have a model as no additional training is required. \nOtherwise, pre-FT VT would be an option as it could reduce the time to fine-tune the model.\nSee the comparison of pre/post-FT VT in our [paper](paper-link).\n\n### VT in Command-Line\nThe `vocabtrimmer` provides following command-line interface to trim a multilingual LM vocabulary.\n```bash\nvocabtrimmer-trimming -m MODEL -l LANGUAGE -p PATH_TO_SAVE [-v TARGET_VOCAB_SIZE] [--repo-id REPO_ID] \n\narguments:\n  -m, --model, model name on huggingface or path to local model\n  -l, --language, language code of tokens to keep\n  -p, --path-to-save, directly to save model\n  -v, --target-vocab-size, [optinoal] vocab size after mining\n  --repo-id, [optinoal] huggingface repo id to push after trimming\n```\nFollowing command trims the vocabulary of `google/mt5-small` to French with top-60k vocabulary. \n```bash\nvocabtrimmer-trimming -m \"google/mt5-small\" -l \"fr\" -v 60000 -p \"ckpts/mt5-small-trimmed-fr-60000\"                       \n```\nThe vocabulary size of multilingual LMs is usually 250k (XLM-R, mBART, mT5), and we recommend setting the target vocabulary size to 60k, \nthe effective vocabulary size. Less vocabulary size than 60k may cause performance degradation, but can retain the original performance in some cases \n(check our [paper](paper-link)). If the target vocabulary size is not specified, it will use whole vocabulary that is appeared in the mC4 dataset or the specified target corpus.\n\n### VT in Python\nThe `vocabtrimmer` provides an API to trim a multilingual LM via python.\nFollowing command trims the vocabulary of `google/mt5-small` to French with top-60k vocabulary.\n```python\nimport vocabtrimmer\n\ntrimmer = vocabtrimmer.VocabTrimmer(\"google/mt5-small\")\ntrimmer.trim_vocab(\n    path_to_save=\"ckpts/mt5-small-trimmed-fr-60000\",\n    language=\"fr\",\n    target_vocab_size=60000)\n```\n\n## Citation\nPlease cite following paper if you use any resource and see the code to reproduce the model if needed.\n\n```\nTBA\n```\n\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "Trimming vocabulary of pre-trained multilingual language models to language localization.",
    "version": "0.0.2",
    "project_urls": {
        "Download": "https://github.com/asahi417/lm-vocab-trmmer/archive/v0.0.2.tar.gz",
        "Homepage": "https://github.com/asahi417/lm-vocab-trmmer"
    },
    "split_keywords": [
        "language model",
        "t5",
        "gpt3",
        "bertnlp",
        "multilingual",
        "efficient-model"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4228ad23df17bee7d4c8121a2dc157adbdc156cecc71e67bf6fd3b6d188f5977",
                "md5": "2aca07177ec171fb32eb4aa21bb67196",
                "sha256": "78fdd543ad02987b85b1bbca2496f17e4d675eb71aff75860d30fc9bc102ff98"
            },
            "downloads": -1,
            "filename": "vocabtrimmer-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "2aca07177ec171fb32eb4aa21bb67196",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 12781,
            "upload_time": "2023-05-21T18:55:17",
            "upload_time_iso_8601": "2023-05-21T18:55:17.480384Z",
            "url": "https://files.pythonhosted.org/packages/42/28/ad23df17bee7d4c8121a2dc157adbdc156cecc71e67bf6fd3b6d188f5977/vocabtrimmer-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-21 18:55:17",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "asahi417",
    "github_project": "lm-vocab-trmmer",
    "github_not_found": true,
    "lcname": "vocabtrimmer"
}

Asahi Ushio