robo-lib


Namerobo-lib JSON
Version 0.0.10 PyPI version JSON
download
home_pageNone
SummaryA package to create, configure, and train transformer models.
upload_time2024-08-26 22:22:33
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # robo-lib

provides tools for creating, configuring, and training custom transformer models on any data available to you.

## Main features:
- Customize and train tokenizers using an implementation of the features from the [tokenizers](https://pypi.org/project/tokenizers/#description) library.
- Customize data processor to process data into individual tensors, ready to be used to train transformers without further processing.
- Configure transformer models to fit specific requirements/specifications without having to write the internal logic.
- Use the 3 components to create, train, and use custom transformers in different applications.

## Installation

```bash
pip install robo-lib
```

## using robo-lib

Documentation can be found [here](https://github.com/hamburgerfish/robo_pack/wiki).

### Language translation example
- In this example, an encoder-decoder transformer is created for language translation, from English to French.
- This example uses two .txt files for training, one with English, and the other with the equivalent French sentence in each line (delimited by "\n").
- Create, train, and save tokenizers using `TokenizerConstructor`.
- In this example, the WordLevel tokenizer is used, along with the detault arguments of `TokenizerConstructor`.

```python
import robo_lib as rl

encoder_tok = rl.TokenizerConstructor(tokenizer_type="WordLevel")
encoder_tok.train("english_data.txt")

decoder_tok = rl.TokenizerConstructor(tokenizer_type="WordLevel")
decoder_tok.train("french_data.txt")

rl.save_component(encoder_tok, "tokenizers/encoder_tok.pkl")
rl.save_component(decoder_tok, "tokenizers/decoder_tok.pkl")
```

- The `DataProcessor` can be used to automatically process the data into a single torch.tensor, easily useable by the transformer for training.
- The tokenizer(s) must be specified when initialising a DataProcessor. In this case the dec_tokenizer, and enc_tokenizer is both specified for an encoder-decoder transformer.
- The `process_list` method processes lists of string data, so our .txt files are read into lists to be processed by `process_list`.
- In this example, we are splitting the data 90% : 10% for training and validation.

```python
proc = rl.DataProcessor(dec_tokenizer=decoder_tok, enc_tokenizer=encoder_tok)

# read training .txt files into lists
with open("english_data.txt", "r") as file:
    english_list = file.read().split("\n")

with open("french_data.txt", "r") as file:
    french_list = file.read().split("\n")

# splitting lists into train and validation sets
split = 0.9
n = int(len(english_list) * split)
english_train = english_list[:n]
french_train = french_list[:n]
english_val = english_list[n:]
french_val = french_list[n:]

# process and save training data as data/training*.pt
# block_size_exceeded_policy="skip" removes training data larger than specified block size
proc.process_list(
    save_path="data/training",
    dec_data=french_train,
    dec_max_block_size=100,
    dec_block_size_exceeded_policy="skip",
    enc_data=english_train,
    enc_max_block_size=100,
    enc_block_size_exceeded_policy="skip"
)

# process and save validation data as data/validation*.pt
proc.process_list(
    save_path="data/validation",
    dec_data=french_val,
    dec_max_block_size=100,
    dec_block_size_exceeded_policy="skip",
    enc_data=english_val,
    enc_max_block_size=100,
    enc_block_size_exceeded_policy="skip"
)
```
- The `RoboConstructor` class is used to create and configure transformer models before trainin.
- A separate .py file is recommended for training.
- If device is not specified, `RoboConstructor` will take the first available one out of ("cuda", "mps", "cpu"). Torch cuda is not part of the dependencies when installing robo-lib, so it is highly recommended to install it, using this [link](https://pytorch.org/get-started/locally/), if you have a CUDA compatible device.
- The `train` method is used to train the transformer and save it to `save_path` every `eval_interval` iterations.
- If a non-`TokenizerConstructor` token is used, the pad token if your tokenizer can be specified instead of the dec_tokenizer parameter.

```python
import robo_lib as rl

encoder_tok = rl.load_component("tokenizers/encoder_tok.pkl")
decoder_tok = rl.load_component("tokenizers/decoder_tok.pkl")

robo = rl.RoboConstructor(
    n_embed=512,
    dec_n_blocks=6,
    dec_n_head=8,
    dec_vocab_size=decoder_tok.vocab_size,
    dec_block_size=100,
    enc_n_blocks=6,
    enc_n_head=8,
    enc_vocab_size=encoder_tok.vocab_size,
    enc_block_size=100
)

robo.train_robo(
    max_iters=20000,
    eval_interval=200,
    batch_size=128,
    dec_training_path="data/training_decoder_data.pt",
    dec_eval_path="data/validation_decoder_data.pt",
    dec_training_masks_path="data/training_decoder_mask_data.pt",
    dec_eval_masks_path="data/validation_decoder_mask_data.pt",
    enc_training_path="data/training_encoder_data.pt",
    enc_eval_path="data/validation_encoder_data.pt",
    enc_training_masks_path="data/training_encoder_mask_data.pt",
    enc_eval_masks_path="data/validation_encoder_mask_data.pt",
    dec_tokenizer=decoder_tok,
    save_path="models/eng_to_fr_robo.pkl"
)
```

- For language translation, a loss of around 3 already shows good results.
- To use the trained transformer, the `generate` method can be employed.
- The temperature, top_k and top_p values can be specified for this method, along with the tokenizers used.
- If a non-`TokenizerConstructor` tokenizer is used, the start, end, separator (decoder-only), and new-line tokens can be specified of your tokenizer.
- In this example, a simple script is created to interact with the user on the command-line, where the user's English input will be translated by the transformer and printed out onto the console in French.

```python
import robo_lib as rl

robo = rc.load_component("models/eng_to_fr_robo.pkl")
encoder_tok = rl.load_component("tokenizers/encoder_tok.pkl")
decoder_tok = rl.load_component("tokenizers/decoder_tok.pkl")

While True:
    query = input()
    print(robo.generate(query, dec_tokenizer=decoder_tok, enc_tokenizer=encoder_tok))
```

### Shakespeare dialogue generator example
- In this example, a decoder-only transformer is created and trained on a file containing all the dialogue written by William Shakespeare in his plays.
- The training data is in the form of a single .txt file containing the dialogue.
- The default BPE tokenizer is used in this case, so no argument is specified for `TokenizerConstructor`.

```python
import robo_lib as rl

tok = rl.TokenizerConstructor()
tok.train("shakespeare_dialogues.txt")

rl.save_component(tok, "tokenizers/shakespeare_tok.pkl")
```

- In this example, instead of having multiple pieces of training data, we have one large text file, from which random chunks of length `block_size` can be used for training. Therefore, a single large string is input into the DataProcessor instead of a list of strings.
- Since this is a decoder-only transformer, encoder arguments are not given.
- Since the entire string should be processed as is, instead of creating blocks of training data, block_size is not specified.
- dec_create_masks is set to False, as there will be no padding in the training data.

```python
proc = rl.DataProcessor(dec_tokenizer=tok)

# read training .txt file
with open("shakespeare_dialogues.txt", "r") as file:
    dialogues_str = file.read()

# splitting string into train and validation sets
split = 0.9
n = int(len(dialogues_str) * split)
train_data = dialogues_str[:n]
val_data = dialogues_str[n:]

# process and save training data as data/shakespeare_train*.pt
proc.process_list(
    save_path="data/shakespeare_train",
    dec_data=train_data,
    dec_create_masks=False
    )

# process and save validation data as data/validation*.pt
proc.process_list(
    save_path="data/shakespeare_valid",
    dec_data=val_data,
    dec_create_masks=False
)
```
- Training the transformer.
```python
import robo_lib as rl

tok = rl.load_component("tokenizers/shakespeare_tok.pkl")

robo = rl.RoboConstructor(
    n_embed=1024,
    dec_n_blocks=8,
    dec_n_head=8,
    dec_vocab_size=tok.vocab_size,
    dec_block_size=200
)

robo.train(
    max_iters=20000,
    eval_interval=200,
    batch_size=64,
    dec_training_path="data/shakespeare_train_decoder_data.pt",
    dec_eval_path="data/shakespeare_valid_decoder_data.pt",
    dec_tokenizer=tok,
    save_path="models/shakespeare_robo.pkl"
)
```
- In this example, the user can specify the start of the generated Shakespeare play and the transformer will generate and print the rest, until `max_new_tokens` (1000) tokens are generated.
- Temperature and top_k are set to 1.2 and 2 respectively to generate a more "creative" output.
```python
import robo_lib as rl

robo = rc.load_component("models/shakespeare_robo.pkl")
tok = rl.load_component("tokenizers/shakespeare_tok.pkl")

While True:
    start = input()
    print(robo.generate(start, max_new_tokens=1000, dec_tokenizer=tok, temperature=1.2, top_k=2))
```
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "robo-lib",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Erik Papp <erik3papp@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/d7/31/4898d75cf542375ff7346f04490716971a2a663d7991f4eaff4e27376db9/robo_lib-0.0.10.tar.gz",
    "platform": null,
    "description": "# robo-lib\n\nprovides tools for creating, configuring, and training custom transformer models on any data available to you.\n\n## Main features:\n- Customize and train tokenizers using an implementation of the features from the [tokenizers](https://pypi.org/project/tokenizers/#description) library.\n- Customize data processor to process data into individual tensors, ready to be used to train transformers without further processing.\n- Configure transformer models to fit specific requirements/specifications without having to write the internal logic.\n- Use the 3 components to create, train, and use custom transformers in different applications.\n\n## Installation\n\n```bash\npip install robo-lib\n```\n\n## using robo-lib\n\nDocumentation can be found [here](https://github.com/hamburgerfish/robo_pack/wiki).\n\n### Language translation example\n- In this example, an encoder-decoder transformer is created for language translation, from English to French.\n- This example uses two .txt files for training, one with English, and the other with the equivalent French sentence in each line (delimited by \"\\n\").\n- Create, train, and save tokenizers using `TokenizerConstructor`.\n- In this example, the WordLevel tokenizer is used, along with the detault arguments of `TokenizerConstructor`.\n\n```python\nimport robo_lib as rl\n\nencoder_tok = rl.TokenizerConstructor(tokenizer_type=\"WordLevel\")\nencoder_tok.train(\"english_data.txt\")\n\ndecoder_tok = rl.TokenizerConstructor(tokenizer_type=\"WordLevel\")\ndecoder_tok.train(\"french_data.txt\")\n\nrl.save_component(encoder_tok, \"tokenizers/encoder_tok.pkl\")\nrl.save_component(decoder_tok, \"tokenizers/decoder_tok.pkl\")\n```\n\n- The `DataProcessor` can be used to automatically process the data into a single torch.tensor, easily useable by the transformer for training.\n- The tokenizer(s) must be specified when initialising a DataProcessor. In this case the dec_tokenizer, and enc_tokenizer is both specified for an encoder-decoder transformer.\n- The `process_list` method processes lists of string data, so our .txt files are read into lists to be processed by `process_list`.\n- In this example, we are splitting the data 90% : 10% for training and validation.\n\n```python\nproc = rl.DataProcessor(dec_tokenizer=decoder_tok, enc_tokenizer=encoder_tok)\n\n# read training .txt files into lists\nwith open(\"english_data.txt\", \"r\") as file:\n    english_list = file.read().split(\"\\n\")\n\nwith open(\"french_data.txt\", \"r\") as file:\n    french_list = file.read().split(\"\\n\")\n\n# splitting lists into train and validation sets\nsplit = 0.9\nn = int(len(english_list) * split)\nenglish_train = english_list[:n]\nfrench_train = french_list[:n]\nenglish_val = english_list[n:]\nfrench_val = french_list[n:]\n\n# process and save training data as data/training*.pt\n# block_size_exceeded_policy=\"skip\" removes training data larger than specified block size\nproc.process_list(\n    save_path=\"data/training\",\n    dec_data=french_train,\n    dec_max_block_size=100,\n    dec_block_size_exceeded_policy=\"skip\",\n    enc_data=english_train,\n    enc_max_block_size=100,\n    enc_block_size_exceeded_policy=\"skip\"\n)\n\n# process and save validation data as data/validation*.pt\nproc.process_list(\n    save_path=\"data/validation\",\n    dec_data=french_val,\n    dec_max_block_size=100,\n    dec_block_size_exceeded_policy=\"skip\",\n    enc_data=english_val,\n    enc_max_block_size=100,\n    enc_block_size_exceeded_policy=\"skip\"\n)\n```\n- The `RoboConstructor` class is used to create and configure transformer models before trainin.\n- A separate .py file is recommended for training.\n- If device is not specified, `RoboConstructor` will take the first available one out of (\"cuda\", \"mps\", \"cpu\"). Torch cuda is not part of the dependencies when installing robo-lib, so it is highly recommended to install it, using this [link](https://pytorch.org/get-started/locally/), if you have a CUDA compatible device.\n- The `train` method is used to train the transformer and save it to `save_path` every `eval_interval` iterations.\n- If a non-`TokenizerConstructor` token is used, the pad token if your tokenizer can be specified instead of the dec_tokenizer parameter.\n\n```python\nimport robo_lib as rl\n\nencoder_tok = rl.load_component(\"tokenizers/encoder_tok.pkl\")\ndecoder_tok = rl.load_component(\"tokenizers/decoder_tok.pkl\")\n\nrobo = rl.RoboConstructor(\n    n_embed=512,\n    dec_n_blocks=6,\n    dec_n_head=8,\n    dec_vocab_size=decoder_tok.vocab_size,\n    dec_block_size=100,\n    enc_n_blocks=6,\n    enc_n_head=8,\n    enc_vocab_size=encoder_tok.vocab_size,\n    enc_block_size=100\n)\n\nrobo.train_robo(\n    max_iters=20000,\n    eval_interval=200,\n    batch_size=128,\n    dec_training_path=\"data/training_decoder_data.pt\",\n    dec_eval_path=\"data/validation_decoder_data.pt\",\n    dec_training_masks_path=\"data/training_decoder_mask_data.pt\",\n    dec_eval_masks_path=\"data/validation_decoder_mask_data.pt\",\n    enc_training_path=\"data/training_encoder_data.pt\",\n    enc_eval_path=\"data/validation_encoder_data.pt\",\n    enc_training_masks_path=\"data/training_encoder_mask_data.pt\",\n    enc_eval_masks_path=\"data/validation_encoder_mask_data.pt\",\n    dec_tokenizer=decoder_tok,\n    save_path=\"models/eng_to_fr_robo.pkl\"\n)\n```\n\n- For language translation, a loss of around 3 already shows good results.\n- To use the trained transformer, the `generate` method can be employed.\n- The temperature, top_k and top_p values can be specified for this method, along with the tokenizers used.\n- If a non-`TokenizerConstructor` tokenizer is used, the start, end, separator (decoder-only), and new-line tokens can be specified of your tokenizer.\n- In this example, a simple script is created to interact with the user on the command-line, where the user's English input will be translated by the transformer and printed out onto the console in French.\n\n```python\nimport robo_lib as rl\n\nrobo = rc.load_component(\"models/eng_to_fr_robo.pkl\")\nencoder_tok = rl.load_component(\"tokenizers/encoder_tok.pkl\")\ndecoder_tok = rl.load_component(\"tokenizers/decoder_tok.pkl\")\n\nWhile True:\n    query = input()\n    print(robo.generate(query, dec_tokenizer=decoder_tok, enc_tokenizer=encoder_tok))\n```\n\n### Shakespeare dialogue generator example\n- In this example, a decoder-only transformer is created and trained on a file containing all the dialogue written by William Shakespeare in his plays.\n- The training data is in the form of a single .txt file containing the dialogue.\n- The default BPE tokenizer is used in this case, so no argument is specified for `TokenizerConstructor`.\n\n```python\nimport robo_lib as rl\n\ntok = rl.TokenizerConstructor()\ntok.train(\"shakespeare_dialogues.txt\")\n\nrl.save_component(tok, \"tokenizers/shakespeare_tok.pkl\")\n```\n\n- In this example, instead of having multiple pieces of training data, we have one large text file, from which random chunks of length `block_size` can be used for training. Therefore, a single large string is input into the DataProcessor instead of a list of strings.\n- Since this is a decoder-only transformer, encoder arguments are not given.\n- Since the entire string should be processed as is, instead of creating blocks of training data, block_size is not specified.\n- dec_create_masks is set to False, as there will be no padding in the training data.\n\n```python\nproc = rl.DataProcessor(dec_tokenizer=tok)\n\n# read training .txt file\nwith open(\"shakespeare_dialogues.txt\", \"r\") as file:\n    dialogues_str = file.read()\n\n# splitting string into train and validation sets\nsplit = 0.9\nn = int(len(dialogues_str) * split)\ntrain_data = dialogues_str[:n]\nval_data = dialogues_str[n:]\n\n# process and save training data as data/shakespeare_train*.pt\nproc.process_list(\n    save_path=\"data/shakespeare_train\",\n    dec_data=train_data,\n    dec_create_masks=False\n    )\n\n# process and save validation data as data/validation*.pt\nproc.process_list(\n    save_path=\"data/shakespeare_valid\",\n    dec_data=val_data,\n    dec_create_masks=False\n)\n```\n- Training the transformer.\n```python\nimport robo_lib as rl\n\ntok = rl.load_component(\"tokenizers/shakespeare_tok.pkl\")\n\nrobo = rl.RoboConstructor(\n    n_embed=1024,\n    dec_n_blocks=8,\n    dec_n_head=8,\n    dec_vocab_size=tok.vocab_size,\n    dec_block_size=200\n)\n\nrobo.train(\n    max_iters=20000,\n    eval_interval=200,\n    batch_size=64,\n    dec_training_path=\"data/shakespeare_train_decoder_data.pt\",\n    dec_eval_path=\"data/shakespeare_valid_decoder_data.pt\",\n    dec_tokenizer=tok,\n    save_path=\"models/shakespeare_robo.pkl\"\n)\n```\n- In this example, the user can specify the start of the generated Shakespeare play and the transformer will generate and print the rest, until `max_new_tokens` (1000) tokens are generated.\n- Temperature and top_k are set to 1.2 and 2 respectively to generate a more \"creative\" output.\n```python\nimport robo_lib as rl\n\nrobo = rc.load_component(\"models/shakespeare_robo.pkl\")\ntok = rl.load_component(\"tokenizers/shakespeare_tok.pkl\")\n\nWhile True:\n    start = input()\n    print(robo.generate(start, max_new_tokens=1000, dec_tokenizer=tok, temperature=1.2, top_k=2))\n```",
    "bugtrack_url": null,
    "license": null,
    "summary": "A package to create, configure, and train transformer models.",
    "version": "0.0.10",
    "project_urls": {
        "Homepage": "https://github.com/hamburgerfish/robo_pack",
        "Issues": "https://github.com/hamburgerfish/robo_pack/issues"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e756795fe2fcb757daf87d6e53c58231d4f3420119a689734dfaa3d674372b69",
                "md5": "3b970d2b2c771cad3613b03de2afc592",
                "sha256": "12d2b46c4cfe36a972ca4ecc27461fcfae7613e4ecaee1d1256dc7398a1943bc"
            },
            "downloads": -1,
            "filename": "robo_lib-0.0.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3b970d2b2c771cad3613b03de2afc592",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 13586,
            "upload_time": "2024-08-26T22:22:32",
            "upload_time_iso_8601": "2024-08-26T22:22:32.254686Z",
            "url": "https://files.pythonhosted.org/packages/e7/56/795fe2fcb757daf87d6e53c58231d4f3420119a689734dfaa3d674372b69/robo_lib-0.0.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d7314898d75cf542375ff7346f04490716971a2a663d7991f4eaff4e27376db9",
                "md5": "76d7edd61df302c94677ef4751b05b1b",
                "sha256": "822cc16e5125458c73d9ebda6cf6a40a52ac60e506a57a1ad3ba28bc46ebe984"
            },
            "downloads": -1,
            "filename": "robo_lib-0.0.10.tar.gz",
            "has_sig": false,
            "md5_digest": "76d7edd61df302c94677ef4751b05b1b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 13124,
            "upload_time": "2024-08-26T22:22:33",
            "upload_time_iso_8601": "2024-08-26T22:22:33.238142Z",
            "url": "https://files.pythonhosted.org/packages/d7/31/4898d75cf542375ff7346f04490716971a2a663d7991f4eaff4e27376db9/robo_lib-0.0.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-26 22:22:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hamburgerfish",
    "github_project": "robo_pack",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "robo-lib"
}
        
Elapsed time: 0.30851s