GPT2LM

Name	GPT2LM JSON
Version	0.0.3 JSON
	download
home_page	https://github.com/YashrajBaila7/GPT2LM
Summary	Implementation of GPT-2 language model for text generation, training custom models and fine-tuning.
upload_time	2024-02-16 15:31:18
maintainer
docs_url	None
author	Yashraj Baila
requires_python	>=3.9
license
keywords	gpt2lm gpt2lm gpt gpt language model yashraj baila ai artificial intelligence ml meachine learning deep learning gpt2 gpt2 gpt3 gpt3 gpt4 gpt4 chatgpt chatgpt chatgpt
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # GPT2LM.bigram

Description: [Implementation of GPT-2 language model for text generation, training custom models and fine-tuning.]

---

## LanguageDataset

---

A class representing a language dataset for the Bigram Language Model.

### Class Description

#### Constructor

    def __init__(self, path: str, train_val_split: int = 90)

Initialize the LanguageDataset with the provided text data.

##### Parameters

1. `path` (str): The file path to the text data.

2. `train_val_split` (int, optional): The percentage split between training and validation data. Defaults to 90 (90% training, 10% validation).

---

### Methods

---

#### \_load_dataset

    def _load_dataset(self, path: str) -> str

Load the text data from the given file path.

##### Parameters

1. `path` (str): The file path to the text data.

##### Returns

1. `text` (str): The loaded text data as a string.

#### \_encode

    def _encode(self, s: str) -> List[int]

Convert the text data into a list of numerical representations using character encoding.

##### Parameters

1. `s` (str): The input text data to be encoded.

##### Returns

1. `List[int]`: A list of integer token indices representing the encoded text.

#### \_decode

    def _decode(self, l: List[int]) -> str

Convert a list of numerical representations back into the original text.

##### Parameters

1. `l` (List[int]): A list of integer token indices representing the encoded text.

##### Returns

1. `str`: The decoded text as a string.

#### \_calculate_split

    def _calculate_split(self, train_val_split: int) -> float

Calculate the split index for separating training and validation data.

##### Parameters

1. `train_val_split` (int): The percentage split between training and validation data.

##### Returns

1. `float`: The split index as a float between 0 and 1, representing the percentage split.

---

### Properties

---

#### text

**@property**

    def text(self) -> str

Get the loaded text data as a string.

#### chars

**@property**

    def chars(self) -> List[str]

Get the list of unique characters present in the text data.

#### vocab_size

**@property**

    def vocab_size(self) -> int

Get the size of the character vocabulary.

#### stoi

**@property**

    def stoi(self) -> Dict[str, int]

Get a dictionary mapping characters to their corresponding integer token indices.

#### itos

**@property**

    def itos(self) -> Dict[int, str]

Get a dictionary mapping integer token indices to their corresponding characters.

#### train_data

**@property**

    def train_data(self) -> torch.Tensor

Get the training data as a tensor containing numerical representations of the text.

#### val_data

**@property**

    def val_data(self) -> torch.Tensor

Get the validation data as a tensor containing numerical representations of the text.

---

---

## Head

A single head of self-attention used in the Transformer block.

### Constructor

    def __init__(self, head_size, n_embd, block_size, dropout):

Initialize the Head with linear layers and a lower triangular mask.

#### Parameters

1. `head_size` (int): The size of the attention head.

2. `n_embd` (int): The embedding dimension of the input.

3. `block_size` (int): The maximum length of the input sequence.

4. `dropout` (float): The dropout rate.

### Methods

#### forward

    def forward(self, x):

Perform the forward pass of the attention head.

##### Parameters

1. `x` (torch.Tensor): The input tensor of shape `(B, T, C)`, where `B` is the batch size, `T` is the sequence length, and `C` is the embedding dimension.

##### Returns

1. `torch.Tensor`: The output tensor after the self-attention operation. It has the same shape as the input tensor `(B, T, C)`.

---

---

## MultiHeadAttention

Multi-head self-attention module used in the Transformer block.

### Class Description

### Constructor

    def __init__(self, num_heads, head_size, n_embd, block_size, dropout):

Initialize the MultiHeadAttention with attention heads and projection layer.

#### Parameters

1. `num_heads` (int): The number of attention heads.

2. `head_size` (int): The size of each attention head.

3. `n_embd` (int): The embedding dimension of the input.

4. `block_size` (int): The maximum length of the input sequence.

5. `dropout` (float): The dropout rate.

---

### Methods

---

#### forward

    def forward(self, x):

Perform the forward pass of the multi-head self-attention module.

##### Parameters

1. `x` (torch.Tensor): The input tensor of shape `(B, T, C)`, where `B` is the batch size, `T` is the sequence length, and `C` is the embedding dimension.

##### Returns

1. `torch.Tensor`: The output tensor after applying the multi-head self-attention. It has the same shape as the input tensor `(B, T, C)`.

---

---

## FeedForward

Feed-forward neural network module used in the Transformer block.

### Constructor

    def __init__(self, n_embd, dropout):

Initialize the FeedForward with linear layers and activation functions.

#### Parameters

1. `n_embd` (int): The embedding dimension of the input.

2. `dropout` (float): The dropout rate.

### Methods

##### forward

    def forward(self, x):

Perform the forward pass of the feed-forward neural network.

##### Parameters

1. `x` (torch.Tensor): The input tensor of shape `(B, T, C)`, where `B` is the batch size, `T` is the sequence length, and `C` is the embedding dimension.

##### Returns

1. `torch.Tensor`: The output tensor after the feed-forward operation. It has the same shape as the input tensor `(B, T, C)`.

---

---

## Block

---

Transformer block module used in the BigramLanguageModel.

### Constructor

    def __init__(self, n_embd, n_head, block_size, dropout):

Initialize the Transformer block with self-attention and feed-forward modules.

#### Parameters

1. `n_embd` (int): The embedding dimension of the input.

2. `n_head` (int): The number of attention heads in the multi-head self-attention.

3. `block_size` (int): The maximum length of the input sequence.

4. `dropout` (float): The dropout rate.

---

### Methods

---

#### forward

    def forward(self, x):

Perform the forward pass of the transformer block.

##### Parameters

1. `x` (torch.Tensor): The input tensor of shape `(B, T, C)`, where `B` is the batch size, `T` is the sequence length, and `C` is the embedding dimension.

##### Returns

1. `torch.Tensor`: The output tensor after applying the transformer block. It has the same shape as the input tensor `(B, T, C)`.

---

---

## BigramLanguageModel

---

### Class Description

#### Constructor

    def __init__(self, vocab_size, n_embd, n_head, n_layer, block_size, dropout)

#### Parameters

1. `vocab_size` (int): The size of the vocabulary, which determines the number of tokens in the language model.

2. `n_embd` (int): The embedding dimension of the input.

3. `n_head` (int): The number of attention heads in the multi-head self-attention.

4. `n_layer` (int): The number of Transformer blocks in the model.

5. `block_size` (int): The maximum length of the input sequence.

6. `dropout` (float): The dropout rate for regularization.

---

### Methods

---

#### \_build_block

    def _build_block(self) -> Block

Creates a single Transformer block with specified embedding dimensions and attention heads.

#### forward

    def forward(self, idx: torch.Tensor, target: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, Optional[torch.Tensor]]

Performs the forward pass of the BigramLanguageModel.

##### Parameters

1. `idx` (torch.Tensor): The input tensor representing the token indices of shape (B, T), where B is the batch size, and T is the sequence length.

2. `target` (torch.Tensor, optional): The target tensor representing the token indices for computing the loss. It should have the same shape as idx. If None, no loss will be computed. (Default: None)

##### Returns

1. `torch.Tensor`: The logits tensor after the forward pass of the model. It has the shape (B, T, vocab_size).

2. `torch.Tensor`: The loss tensor, computed using F.cross_entropy if the target is provided. Otherwise, None. It has the shape (B \* T,) if target is not None.

#### generate

    def generate(self, idx: torch.Tensor, max_new_tokens: int) -> List[int]

Generates new tokens using the trained language model.

##### Parameters

1. `idx` (torch.Tensor): The input tensor representing the token indices of shape (B, T), where B is the batch size, and T is the sequence length.

2. `max_new_tokens` (int): The maximum number of new tokens to generate.

##### Returns

1. `List[int]`: A list of generated token indices as integers.

---

---

## Utility Functions

---

### set_hyperparams

    def set_hyperparams(**kwargs)

Update the hyperparameters dictionary with the provided keyword arguments.

#### Parameters

1. `**kwargs`: Keyword arguments with hyperparameter names as keys and their corresponding values.

### load_dataset

    def load_dataset(path: str, train_val_split: int = 90) -> LanguageDataset

Load a dataset from the given file path and create a **LanguageDataset** object.

#### Parameters

1. `path` (str): The file path of the dataset to load.

2. `train_val_split` (int, optional): Percentage split between training and validation data. Default is 90, which means 90% training and 10% validation.

#### Returns

1. `LanguageDataset` : An instance of the **LanguageDataset** class containing the loaded dataset.

### initialize_model

    def initialize_model(vocab_size: int, n_embd: Optional[int] = None, n_head: Optional[int] = None, n_layer: Optional[int] = None, block_size: Optional[int] = None, dropout: Optional[float] = None, device: Optional[str] = None) -> BigramLanguageModel

Initialize a BigramLanguageModel with the given hyperparameters.

#### Parameters

1. `vocab_size` (int): The size of the vocabulary, which determines the number of tokens in the language model.

2. `n_embd` (int, optional): The embedding dimension of the input. If not provided, it will be set to a default value.

3. `n_head` (int, optional): The number of attention heads in the multi-head self-attention. If not provided, it will be set to a default value.

4. `n_layer` (int, optional): The number of Transformer blocks in the model. If not provided, it will be set to a default value.

5. `block_size` (int, optional): The maximum length of the input sequence. If not provided, it will be set to a default value.

6. `dropout` (float, optional): The dropout rate for regularization. If not provided, it will be set to a default value.

7. `device` (str, optional): The device on which to place the model (e.g., 'cpu' or 'cuda'). If not provided, it will be set based on the availability of a GPU.

#### Returns

1. `BigramLanguageModel`: An instance of the **BigramLanguageModel** class initialized with the given hyperparameters.

### train

    def train(model: BigramLanguageModel, train_data: torch.Tensor, val_data: torch.Tensor, learning_rate: Optional[float] = None, max_iters: Optional[int] = None, eval_interval: Optional[int] = None, device: Optional[str] = None, eval_iters: Optional[int] = None) -> BigramLanguageModel

Train the provided model on the given training data and validate it on the validation data.

#### Parameters

1. `model` (BigramLanguageModel): The language model to train.

2. `train_data` (torch.Tensor): The training data tensor.

3. `val_data` (torch.Tensor): The validation data tensor.

4. `learning_rate` (float, optional): The learning rate for the optimizer. If not provided, it will be set to a default value.

5. `max_iters` (int, optional): The maximum number of training iterations. If not provided, it will be set to a default value.

6. `eval_interval` (int, optional): Interval for printing evaluation results during training. If not provided, it will be set to a default value.

7. `device` (str, optional): The device on which to train the model (e.g., 'cpu' or 'cuda'). If not provided, it will be set based on the availability of a GPU.

8. `eval_iters` (int, optional): Number of iterations for evaluation during training. If not provided, it will be set to a default value.

#### Returns

1. `BigramLanguageModel`: The trained model.

### save

    def save(path: str, model: BigramLanguageModel)

Save the state_dict of the provided model to the specified file path.

#### Parameters

1. `path` (str): The file path where the model state_dict will be saved.

2. `model` (BigramLanguageModel): The model whose state_dict will be saved.

### load

    def load(path: str, model: BigramLanguageModel, device: Optional[str] = None)

Load the model state_dict from the specified file path and assign it to the provided model.

#### Parameters

1. `path` (str): The file path from which the model state_dict will be loaded.

2. `model` (BigramLanguageModel): The model to which the loaded state_dict will be assigned.

3. `device` (str, optional): The device on which to load the model (e.g., 'cpu' or 'cuda'). If not provided, it will be set based on the availability of a GPU.

### fine_tune_model

    def fine_tune_model(model: BigramLanguageModel, learning_rate: float, max_iters: int, eval_interval: int, train_data: torch.Tensor, val_data: torch.Tensor) -> BigramLanguageModel

Fine-tune the provided model using the given training data.

#### Parameters

1. `model` (BigramLanguageModel): The model to be fine-tuned.

2. `learning_rate` (float): The learning rate for the optimizer during fine-tuning.

3. `max_iters` (int): The maximum number of iterations for fine-tuning.

4. `eval_interval` (int): The interval at which to evaluate the model's performance during fine-tuning.

5. `train_data` (torch.Tensor): The training data for fine-tuning.

6. `val_data` (torch.Tensor): The validation data for evaluating the model's performance during fine-tuning.

#### Returns

1. `BigramLanguageModel`: The fine-tuned model.

### gpt2_124M

    def gpt2_124M(training_data_path: str, lr: float, max_iters: int, dropout: float = 0.1, eval_iters: int = 200, train_val_split: int = 85) -> BigramLanguageModel

Train a GPT-2 model with 124 million parameters on the provided training data.

#### Parameters

1. `training_data_path` (str): The path to the training data file.

2. `lr` (float): The learning rate for the optimizer during training.

3. `max_iters` (int): The maximum number of iterations for training.

4. `dropout` (float, optional): The dropout rate to be used in the model. Default is 0.1.

5. `eval_iters` (int, optional): The interval at which to evaluate the model's performance during training. Default is 200.

6. `train_val_split` (int, optional): The percentage of data to be used for training, and the rest for validation. Default is 85.

#### Returns

1. `BigramLanguageModel`: The trained GPT-2 model.

#### NOTE

1. **_`Parameters count depends on the vocab_size.`_**

### gpt2_finetune

    def gpt2_finetune(pretrained_model: BigramLanguageModel, training_data_path: str, lr: float, max_iters: int, eval_iters: int = 200, train_val_split: int = 85) -> BigramLanguageModel

Fine-tune a pretrained GPT-2 model on the provided training data.

#### Parameters

1. `pretrained_model` (BigramLanguageModel): The pretrained GPT-2 model to be fine-tuned.

2. `training_data_path` (str): The path to the training data file.

3. `lr` (float): The learning rate for the optimizer during fine-tuning.

4. `max_iters` (int): The maximum number of iterations for fine-tuning.

5. `eval_iters` (int, optional): The interval at which to evaluate the model's performance during fine-tuning. Default is 200.

6. `train_val_split` (int, optional): The percentage of data to be used for training, and the rest for validation. Default is 85.

#### Returns

1. `BigramLanguageModel`: The fine-tuned model.

### count_parameters

    def count_parameters(model: torch.nn.Module) -> int

Count the total number of trainable parameters in the given PyTorch model.

#### Parameters

1. `model` (torch.nn.Module): The PyTorch model for which the parameters need to be counted.

#### Returns

1. `int`: The total number of trainable parameters in the model.

### get_batch

    def get_batch(split: str, train_data: torch.Tensor, val_data: torch.Tensor, block_size: int = None, batch_size: int = None, device: str = None) -> Tuple[torch.Tensor, torch.Tensor]

Get a batch of data for training or validation from the specified split.

#### Parameters

1. `split` (str): The split to get the data from ('train' or 'val').

2. `train_data` (torch.Tensor): The training data tensor.

3. `val_data` (torch.Tensor): The validation data tensor.

4. `block_size` (int, optional): The size of each input sequence block. Defaults to None.

5. `batch_size` (int, optional): The number of sequences in a batch. Defaults to None.

6. `device` (str, optional): The device to store the batch data on ('cpu' or 'cuda'). Defaults to None.

#### Returns

1. `Tuple`[torch.Tensor, torch.Tensor]: A tuple containing two tensors - the input tensor representing the batch of sequences (x) and the target tensor representing the batch of sequences shifted by one (y).

### estimate_loss

    def estimate_loss(model: torch.nn.Module, train_data: torch.Tensor, val_data: torch.Tensor, eval_iters: int = None, block_size: int = None, batch_size: int = None, device: str = None) -> Dict[str, float]

Estimate the average loss of the model on the training and validation data.

### Parameters

1. `model` (torch.nn.Module): The PyTorch model for which the loss is to be estimated.

2. `train_data` (torch.Tensor): The training data tensor.

3. `val_data` (torch.Tensor): The validation data tensor.

4. `eval_iters` (int, optional): The number of iterations for estimating the loss. Defaults to None.

5. `block_size` (int, optional): The size of each input sequence block. Defaults to None.

6. `batch_size` (int, optional): The number of sequences in a batch. Defaults to None.

7. `device` (str, optional): The device to perform computations on ('cpu' or 'cuda'). Defaults to None.

#### Returns

1. `Dict`[str, float]: A dictionary containing the average loss on the training and validation data.

   _`Keys`_: 'train' and 'val'

   _`Values`_: Average loss values (float)

### default_hyperparams

    default_hyperparams = {
                                'batch_size': 64,
                                'block_size': 256,
                                'max_iters': 5000,
                                'learning_rate': 5.859375e-05,
                                'device': 'cuda' if torch.cuda.is_available() else 'cpu',
                                'eval_iters': 200,
                                'n_embd': 384,
                                'n_head': 6,
                                'n_layer': 6,
                                'dropout': 0.2,

    }

A dictionary containing the default hyperparameters for the GPT-2 model.

### set_hyperparamsdefault

      def set_hyperparamsdefault()

Set the default hyperparameters for the GPT-2 model using the set_hyperparams function.

---

## Example Usage

### Train model on default Hyperparameters

    # Import the required functions and classes

    from GPT2LM import bigram as b

    # Path to your training data file

    training_data_path = 'path_to_your_training_data.txt'

    # Load the dataset and split into training and validation sets (default split: 85% training, 15% validation)

    dataset = b.load_dataset(training_data_path)

    # set default hyperparams

    b.set_hyperparams()

    # Initialize the GPT-2 model with default hyperparameters and train on the dataset

    model = b.initialize_model(vocab_size=dataset.vocab_size) # keep record of the 'vocab_size' to use later.

    # Train the model on the training data and validate it on the validation data

    trained_model = b.train(model, dataset.train_data, dataset.val_data)

    # Save the trained model to a file

    b.save('path_to_save_model.pt', trained_model)

### Train model on custom hyperparameters

    # Import the required functions and classes

    from GPT2LM import bigram as b

    # Path to your training data file

    training_data_path = 'path_to_your_training_data.txt'

    # Load the dataset and split into training and validation sets (default split: 85% training, 15% validation)

    dataset = b.load_dataset(training_data_path)

    # set custom hyperparams

    3 This will assign custom values to default hyperparams

    batch_size=32
    block_size=500
    max_iters=10000
    learning_rate=0.0001
    eval_interval=250
    n_embd=800
    n_head=8
    n_layer=8
    dropout=0.1

    b.set_hyperparams(batch_size=batch_size,
                    block_size=block_size,
                    max_iters=max_iters,
                    learning_rate=learning_rate,
                    eval_interval=eval_interval,
                    n_embd=n_embd,
                    n_head=n_head,
                    n_layer=n_layer,
                    dropout=dropout
    )

    # Initialize the GPT-2 model with custom hyperparameters and train on the dataset

    model = b.initialize_model(vocab_size=dataset.vocab_size) # keep record of the 'vocab_size' to use later.

    # Train the model on the training data and validate it on the validation data

    trained_model = b.train(model, dataset.train_data, dataset.val_data)

    # Save the trained model to a file

    b.save('path_to_save_model.pt', trained_model)

### Generate text

    # initiate context
    context=torch.zeros((1,1),dtype=torch.long,device=device)

    # generate the new tokens and decode them into text
    text=dataset._decode(model.generate(context,max_new_tokens=200))

### Generate text with prompt

    # initiate context
    context=torch.zeros((1,1),dtype=torch.long,device=device)

    # Give prompt

    prompt = "To be or not to be, that is the question."

    # encode prompt into tokens

    prompt_encoded = torch.tensor(dataset._encode(prompt), dtype=torch.long, device=device)
    prompt_encoded = prompt_encoded.unsqueeze(0)

    # initiate context with prompt

    context_with_prompt = torch.cat((context, prompt_encoded), dim=1)

    # generate the new tokens and decode them into text

    text=dataset._decode(model.generate(context_with_prompt,max_new_tokens=200))

## About

This project is designed for educational purposes to help you learn and understand how generative AI works. It demonstrates the implementation of a GPT-2 variant, a popular generative language model, and provides a hands-on experience in training language models on custom datasets. By exploring the code and running the provided examples, you can gain insights into the underlying principles of generative AI and natural language processing.

Please note that while this project serves as a learning resource, it is essential to be mindful of the ethical considerations and potential risks associated with AI models. The use of AI technology, especially in generating text, should be done responsibly and with respect for ethical guidelines.



### How GPT works

Generative Pre-trained Transformer (GPT) models, such as GPT-2, are designed to predict the next word or token in a sequence of text. They achieve this by leveraging the power of self-attention mechanisms within the Transformer architecture.

At a high level, here's how GPT models work to predict the next word or token:

1. `Language Modeling`: GPT models are trained on large amounts of text data using a process called language modeling. During training, the model learns the statistical patterns and dependencies present in the input text. It captures the relationships between different words and the likelihood of one word following another in a sentence.

2. `Contextual Understanding`: The key feature of GPT models is their ability to understand the context in which a word or token appears. Instead of treating each word in isolation, the model considers the entire sequence of tokens and assigns higher importance to the tokens that are most relevant for understanding the current word.

3. `Self-Attention`: Self-attention is the mechanism that allows GPT models to weigh the importance of different tokens in the input sequence based on their contextual relevance. Tokens that are semantically related or have strong dependencies receive higher attention scores, while unrelated tokens receive lower attention scores.

4. `Autoregressive Generation`: To predict the next word or token in a sequence, GPT models use an autoregressive generation process. This means that the model predicts each token one by one, conditioned on the tokens that came before it in the sequence. The previously predicted tokens serve as context to guide the prediction of the next token.

5. `Text Completion`: By repeatedly predicting the next token and using the predicted tokens as context, the GPT model can complete a given prompt or generate entirely new text. The generated text is coherent and contextually appropriate because the model has learned from vast amounts of text data during pre-training.

6. `Adaptability`: GPT models are highly adaptable and can be fine-tuned on specific tasks or domains after pre-training. This process allows them to perform well on a wide range of natural language processing tasks, such as text generation, machine translation, question answering, and more.

In summary, GPT models are powerful language models that predict the next word or token in a sequence by understanding the context and dependencies between words. Their ability to generate coherent and contextually relevant text makes them valuable tools for various natural language processing applications.

#### How we trick them to chat with user

As these models are designed to complete given documents, they are not capable of engaging in real-time conversations with users. If you present them with a question, they will simply generate more questions that resemble the input. To overcome this limitation, we use a specific format for interaction.

For instance, we frame the conversation as if it's between a human and an AI. This format helps the AI understand its role and context better. Here's an example prompt:

"AI: What can I do for you today?
\nHuman: ..."

In this format, the human provides their input after "Human:", and the AI responds accordingly. The entire conversation is treated as a single document, with the AI's responses generated by the model and the human's responses provided by an actual person.

This approach allows us to harness the power of the language model to generate creative and contextually relevant AI responses, while human input ensures that the conversation remains meaningful and coherent.

By using this interaction format, we can effectively leverage the capabilities of the language model in completing text documents, making it a useful tool for various natural language processing tasks.

### Installation

    pip install GPT2LM

### What is new in this version?

There are no such big changes in bigram, the only difference is the use of the **`GELU`** activation function instead of the **`ReLU`** activation function.

`ReLU(x)` or `Rectified Linear Unit` = max(0,x)

`GELU(x)` or `Gaussian Error Linear Unit` = xΦ(x) = 0.5x(1+tanh[2/π(x+0.044715x^3)])

#### Why GELU is used?

The `Gaussian Error Linear Unit (GELU)` is an activation function used in deep learning. It is defined as `xΦ(x)`, where `Φ(x)` is the `standard Gaussian cumulative distribution function`. The GELU nonlinearity weights inputs by their percentile, rather than gating inputs by their sign as in ReLUs. Consequently, the GELU can be thought of as a smoother ReLU.

It is not known why certain activation functions work better than others in different contexts. So the only answer for “why use GELU instead of ReLu” is "because it works better". However, there are some possible explanations. For example, ReLU can suffer from "problems where a significant amount of neurons in the network become zero and don’t practically do anything". GELU is smoother near zero and “is differentiable in all ranges, and allows to have gradients (although small) in the negative range” which helps with this problem.

In summary, GELU is used in generative AI instead of ReLU because it has been found to work better in practice. It is a smoother version of ReLU that allows for gradients in the negative range, which can help prevent problems with neurons becoming inactive.

#### But why even use activation functions?

Activation functions are used in neural networks to introduce non-linearity into the output of a neuron. Without an activation function, the weights and bias would simply do a linear transformation. A linear equation is simple to solve but is limited in its capacity to solve complex problems and have less power to learn complex functional mappings from data. A neural network without an activation function is just a **`linear regression model`**.

Generally, neural networks use non-linear activation functions, which can help the network learn complex data, compute and learn almost any function representing a question, and provide accurate predictions. Activation functions provide the building blocks that can be used repeatedly in two dimensions of the network structure so that, combined with an attenuation matrix to vary the weight of signaling from layer to layer, is known to be able to approximate an arbitrary and complex function.

In summary, activation functions are used in neural networks to introduce non-linearity into the output of a neuron, which allows the network to learn complex data and provide accurate predictions. They are essential building blocks of neural networks.

### Is the model is GPT (GPT VS Bigram) ?

A bigram is a type of n-gram model that represents text as the joint probability of a sequence of two words. Bigrams, along with other n-grams, are used in most successful language models for speech recognition.

On the other hand, GPT (Generative Pretrained Transformer) is a type of neural language model that uses a transformer architecture to generate text. GPT is a modern Transformer language model, which is more complex than a bigram character-level language model.

In summary, a bigram is a simple statistical model that represents text as the joint probability of a sequence of two words, while GPT is a more complex neural language model that uses a transformer architecture to generate text. The two models have different strengths and weaknesses and can be used for different purposes.

#### But it does not answer my question.

Certainly, let's differentiate between `GPT` (a type of transformer model), a `bigram` model, and the `custom BigramLanguageModel` :

1. **`GPT (Generative Pre-trained Transformer)`**:

   1. `Architecture`: GPT is based on the transformer architecture, which consists of a stack of multiple layers of self-attention and feedforward neural networks.
   2. `Self-Attention`: GPT uses multi-head self-attention to capture relationships between tokens in the input sequence.
   3. `Parameter Sharing`: GPT is a fully autoregressive model, meaning it generates one token at a time based on the previously generated tokens. It doesn't rely on fixed n-grams like bigram models.
   4. `Pre-training`: GPT is pre-trained on a large corpus of text data, learning language patterns, and semantic understanding from diverse text sources.
   5. `Generative`: GPT is designed for generating text and is commonly used for tasks like text completion, text generation, and language translation.

2. **`Bigram Model`**:

   1. `Architecture`: A bigram model is a simple statistical language model that focuses on pairs of consecutive words (bigrams) in a text corpus.
   2. `Scope`: It doesn't have a deep neural network architecture like transformers or GPT. Instead, it relies on counting the frequency of bigrams in a given text corpus.
   3. `Parameterization`: Bigram models have a relatively small number of parameters compared to deep learning models.
   4. `Generative Power`: Bigram models are less expressive than GPT and cannot capture complex language structures or long-range dependencies in the same way.
   5. `Use Case`: Bigram models are often used for basic language modeling tasks but are less suited for generating coherent and contextually rich text compared to GPT.

3. **`Custom Model (BigramLanguageModel)`**:

   1. `Architecture`:This model a custom implementation. It combines elements of self-attention and feedforward neural networks within each block.
   2. `Embeddings`: It uses token embeddings and positional embeddings, which are common in transformer-based models.
   3. `Multi-Head Attention`: Like GPT, it incorporates multi-head self-attention within each block.
   4. `Complexity`: This custom model is more complex and expressive than a simple bigram model but may not have the same generative power or pre-training as GPT.

In summary, GPT is a state-of-the-art transformer-based model designed for generative tasks, bigram models are simpler statistical models focusing on word pairs, and the custom model combines elements of transformers but is a custom implementation tailored for a specific purpose. Each model has its own strengths and weaknesses depending on the task at hand.

### What you have to do.

Go to Colab or Kaggle and train two models, both on the same dataset but with different versions of GPT2LM, to understand how the model works with different activation functions and why generative AI like GPT uses GELU instead of ReLU, probably the most popular activation function.

Also implement your own ReLU and GELU function and plot the graph of them using matplotlib on two or three different ranges to see the actual difference in both functions because on large ranges the graph will overlap each other and appears as the same graph. So on smaller ranges [-1,1] you will see a very small difference in graphs.


# GPT2LM.gpt2

Description: [Implementation of GPT-2 language model for text generation, training custom models and fine-tuning.]

## Utils Module Documentation

### Overview

The `utils` module provides various utility functions used for common tasks such as setting random seeds for reproducibility, setting up logging for experiments, loading/saving models and data, and generating text using language models. Additionally, it includes a lightweight configuration class inspired by YACS (`CfgNode`) for managing experiment configurations.

### Functions

1. `set_seed(seed)`
   - Description: Set the random seeds for reproducibility across different runs.
   - Parameters:
   - `seed` (int): The seed value to set for random number generators. 
2. `setup_logging(config)`
   - Description: Set up logging for the experiment. This function creates necessary directories and logs configurations.
   - Parameters:
     - `config` (object): Configuration object containing experiment settings.
3. `CfgNode`
    - Description: A lightweight configuration class inspired by YACS. This class provides a flexible way to manage configurations for experiments or applications.
    - `load_text(path)`
    - Description: Load text data from a file.
    - Parameters:
      - `path` (str): The path to the text file.
    - Returns:
      - `str` or `None`: The content of the text file as a string if successful, otherwise `None`.
4. `load_text(path)`

- **Description**: Load text data from a file.

- **Parameters**:
  - `path` (str): The path to the text file.

- **Returns**:
  - `str` or `None`: The content of the text file as a string if successful, otherwise `None`.

- **Raises**:
  - `FileNotFoundError`: If the specified file path does not exist.
  - `Exception`: If an error occurs while reading the file.

 5. `load_json_to_text(path)`

- **Description**: Load JSON data from a file and convert it into a single text string.

- **Parameters**:
  - `path` (str): The path to the JSON file.

- **Returns**:
  - `str` or `None`: A single text string containing the JSON data if successful, otherwise `None`.

- **Raises**:
  - `FileNotFoundError`: If the specified file path does not exist.
  - `Exception`: If an error occurs while reading the JSON file or converting it to text.

---

6. `load_csv_to_text(path)`

- **Description**: Load data from a CSV file and convert it into a single text string.

- **Parameters**:
  - `path` (str): The path to the CSV file.

- **Returns**:
  - `str` or `None`: A single text string containing the CSV data if successful, otherwise `None`.

- **Raises**:
  - `FileNotFoundError`: If the specified file path does not exist.
  - `Exception`: If an error occurs while reading the CSV file or converting it to text.

  1. `save(path, model)`
     - Description: Save the model state dictionary to a file.
      - Parameters:
        - `path` (str): Path to save the model state dictionary.
        - `model `(`torch.nn.Module`): Model to save.
  2. `load(path, model, device=None)`
      - Description: Load the model state dictionary from a file.
      - Parameters:
        - `path` (str): Path to the saved model state dictionary.
        - `model` (`torch.nn.Module`): Model to load the state dictionary into.
        - `device` (str, optional): Device to load the model onto (e.g., 'cuda' or 'cpu'). If None, it will automatically detect CUDA availability.
     - Returns:
       - `None`
  3. `Generate(model, prompt, encoder_path, max_len)`
     - Description: Generate text based on a given prompt using the provided model.
     - Parameters:
       - `model` (`torch.nn.Module`): The language model used for text generation.
       - `prompt` (str): The prompt to start text generation.
       - `encoder_path` (str): Path to the encoder file for encoding/decoding characters.
       - `max_len` (int): Maximum length of the generated text.
     - Returns:
        - `str`: Generated text based on the prompt.
  4.  `print_config()`
      - Description: Function to print the configurations of various GPT models.
      - Prints the configurations of different GPT models including the number of layers, attention heads, and embedding size.
  ### Classes
  - `CfgNode`: Lightweight configuration class inspired by YACS.
  ### Dependencies
  - `os`, `sys`, `csv`, `math`, `json`, `time`, `random`, `requests`, `regex`, `numpy`, `tqdm`:  Standard Python libraries for various tasks.
  - `torch`, `torch.nn`, ``torch.utils.data`: PyTorch library for deep learning.
  - `torch.utils.data.dataloader`: DataLoader class for loading data in batches.
  ### Global Variables
  - `device`: Represents the device on which tensors reside, either "cuda" (GPU) if available or "cpu" if not available.
 ### Notes
 - Ensure proper installation of PyTorch and other dependencies to utilize all functions effectively.
  
  ## GPT Language Model

  The GPT (Generative Pre-trained Transformer) Language Model is a powerful model architecture based on the Transformer architecture. This implementation includes functionalities for training and generation of text.

  ### Model Architecture:

  1. **Embedding Layers:**
     - Token Embeddings: Embeds input tokens into a continuous vector space.
     - Position Embeddings: Embeds the position of each token in the input sequence.
  2. **Transformer Blocks:**
     - Each block consists of a causal self-attention layer followed by a feed-forward neural network (MLP).
     - Layer normalization and residual connections are applied within each block.
  3. **Language Model Head:**
     - Linear layer projecting the output of the transformer blocks into logits over the vocabulary.

  ### Key Features:

  1. **Configurable:** 
     - The model architecture can be configured using a `CfgNode` object, allowing customization of the number of layers, heads, embedding size, and other parameters.
   2. **Optimization:**
      - Provides methods for configuring optimizers for training, including options for weight decay and learning rate scheduling.
   3. **Generation:**
      - Includes functionality for generating new sequences of text given a conditioning sequence.
   4. **Initialization:**
      - Parameters are initialized following best practices for transformer-based models, including custom initialization for certain layers.
  ### Parameters:
  - `config`: Configuration object containing model parameters.
  - `idx`: Input tensor representing token indices.
  - `targets`: Target tensor for training.
  - `max_new_tokens`: Maximum number of tokens to generate during text generation.
  - `temperature`: Temperature parameter for sampling during generation.
  - `do_sample`: Boolean indicating whether to sample from the distribution during generation.
  - `top_k`: Number of top-k tokens to consider during sampling.
  ### Methods:
  - `forward(idx, targets)`: Performs a forward pass through the model.
  - `generate(idx, max_new_tokens, temperature, do_sample, top_k)`: Generates new tokens given a conditioning sequence.
  ### Attributes:
  - `block_size`: Maximum block size for input sequences.
  - `device`: Device (CPU or GPU) where the model is currently located.
  ## CharDataset
  The `CharDataset` class is designed for handling character-level text data. It prepares the data for training by encoding characters into numerical indices. This dataset is suitable for language modeling tasks where the model predicts the next character in a sequence given the previous characters.

  ### Initialization:
       CharDataset(data, block_size, output_dir='dir/')

- `data (str)`: Text data to be processed.
- `block_size (int)`: Size of the text blocks used for training.
- `output_dir (str, optional)`: Output directory to save the encoder JSON file.

### Attributes:
- `vocab_size`: Number of unique characters in the dataset.
- `block_size`: Size of the text blocks.
- `data`: Raw text data.
  ### Methods:

- `get_vocab_size()`: Get the vocabulary size.
- `get_block_size()`: Get the block size.
- `__len__()`: Get the length of the dataset.
- `__getitem__(idx)`: Get an item from the dataset by index.
---

## Encoder

The `Encoder` class provides functionality for encoding and decoding text using an encoder dictionary. It loads the encoder dictionary from a JSON file and provides methods to encode and decode text.

### Initialization:
    Encoder(encoder_path)

- `encoder_path (str)`: Path to the encoder JSON file.
### Methods:
- `encode(s)`: Encode a string into a list of integers. 
- `decode(l)`: Decode a list of integers into a string.
  ### Example Usage:

      encoder = Encoder("encoder.json")
      encoded_text = encoder.encode("Hello, world!")
      decoded_text = encoder.decode(encoded_text)

---

## FineEncoderDataset

The `FineEncoderDataset` class is similar to `CharDataset` but allows for fine control over the encoding process. It requires an external encoder file (JSON format) for encoding and decoding the text data.

### Initialization:

    FineEncoderDataset(data, encoder_path, block_size)

- `data (str)`: Input data for the dataset.
- `encoder_path (str)`: Path to the encoder JSON file.
- `block_size (int)`: Size of each block in the dataset.

### Methods:

- `encode(s)`: Encode the input string using the encoder.
- `decode(l)`: Decode the input list using the encoder.

### Example Usage:

    dataset = FineEncoderDataset(data, "encoder.json", block_size=128)

---

## Trainer

The `Trainer` class is responsible for training neural networks. It provides functionality for running the training loop, handling data loading, optimization, and callback triggering.

### Initialization:

    Trainer(config, model, train_dataset)

- `config (CfgNode)`: Configuration for the Trainer.
- `model (nn.Module)`: The neural network model.
- `train_dataset`: The training dataset.

### Attributes:
- `device`: Device (CPU or GPU) used for training.
- `optimizer`: Optimizer used for training.
- `callbacks`: Dictionary storing callback functions for different events.
- `iter_num`: Current iteration number.
- `iter_time`: Time at the current iteration.
- `iter_dt`: Time difference between the current and previous iteration.
 
### Methods:
- `add_callback(onevent, callback)`: Add a callback function to be triggered on an event.
- `set_callback(onevent, callback)`: Set a callback function to be triggered on an event.
- `trigger_callbacks(onevent)`: Trigger the callbacks for a specific event.
- `run(num_epochs=None)`: Run the training loop.
  
  ### Example Usage:

      trainer = Trainer(config, model, train_dataset)
      trainer.run(num_epochs=10)


---

## FineTuner

The `FineTuner` class is used for fine-tuning a pre-trained model with new data. It initializes the training process with the provided model, data, encoder path, block size, and training configuration.

### Initialization:

    FineTuner(model, data, encoder_path, block_size, train_config)

- `model (torch.nn.Module)`: Model to fine-tune.
- `data (str)`: Input data for fine-tuning.
- `encoder_path (str)`: Path to the encoder file.
- `block_size (int)`: Size of each block in the dataset.
- `train_config (CfgNode)`: Training configuration for the fine-tuning process.

### Attributes:

- `device`: Device (CPU or GPU) used for fine-tuning.

### Methods:

- `fine_tune(epoch=1)`: Perform fine-tuning of the model for the specified number of epochs.

### Example Usage:

    fine_tuner = FineTuner(model, data, encoder_path, block_size, train_config)
    fine_tuner.fine_tune(epoch=5)


---

## Training a model

    from GPT2LM import gpt2

    block_size=1024 #Example block_size, you can choose your own
    text=gpt2.load_text('path to .txt) # Can use load_csv() or load_json() for csv and json 
    train_dataset=gpt2.CharDataset(text, block_size, 'out_dir') # Encoder.json will be save in out_dir

    model_config = gpt2.GPT.get_default_config()
    model_config.model_type = 'gpt-micro' # more model typesare available, use `print_config()` to see all
    model_config.vocab_size = train_dataset.vocab_size
    model_config.block_size = block_size  
    model = gpt2.GPT(model_config)


    train_config = gpt2.Trainer.get_default_config()
    train_config.learning_rate = 5e-4 
    train_config.max_iters = 1000 #iters per epoch
    train_config.batch_size = 64
    trainer = gpt2.Trainer(train_config, model, train_dataset)
    trainer.run(10) # number of epochs = 10

    print(gpt2.Generate(model, 'Hi', 'path_to_encoder.json', 200))  # 'Hi' is prompt and 200 is the maximum length of generated response means 200 characters will be generated


## Fine Tuning a model

    from GPT2LM import gpt2

    block_size=1024  # same as pre-trained model

    model_config = gpt2.GPT.get_default_config()
    model_config.model_type = 'gpt-micro'
    model_config.vocab_size = n # same as pre-trained model
    model_config.block_size = block_size  
    model = gpt2.GPT(model_config)

    load('path_to_pre_trained_model', model) # Load model weights from pre-trained model
    
    encoder_path = "path_to_encoder.json"
    fine_data=gpt2.load_text('path_to_txt_file)'
    fine_config = gpt2.Trainer.get_default_config()
    fine_config.learning_rate = 5e-4
    fine_config.max_iters = 1000
    fine_config.batch_size = 64
    finetuner = gpt2.FineTuner(model, fine_data, encoder_path, block_size, fine_config)
    finetuner.fine_tune(3)


## License

This project is licensed under the [MIT License](LICENSE).

## Talks
You can check the model `Model_Card` for all the available model types in `GPT2LM.gpt2`.

Here are some links for better understanding.

1. **_`Attention Is All You Need`_** [link here](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)

2. **_`Textbooks Are All You Need`_** [Link here](https://www.researchgate.net/publication/371729045_Textbooks_Are_All_You_Need)

3. **_`GPT(Generative Pre-trained Transformers)`_** [Link here](https://aws.amazon.com/what-is/gpt/)

4. **_`Language Models are Unsupervised Multitask Learners`_** [Link here](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf)

5. **_`GPT-2`_** [Link here](https://en.wikipedia.org/wiki/GPT-2)
6. **_`GELU`_** [Link here](https://paperswithcode.com/method/gelu)
7. **_`ReLU`_** [Link here](https://paperswithcode.com/method/relu)
8. **_`ReLU VS GELU`_** [Link here](https://towardsai.net/p/l/is-gelu-the-relu-successor)
9. **_`GIT Link of this project`_** [Link here](https://github.com/YashrajBaila7/GPT2LM)

10. **_`PyPI Link of this project`_** [Link here](https://pypi.org/project/GPT2LM/)


Code is completely open source so you can play around it and get a better understanding of it.

**_GPT-3_** Hyperparameter are not publicly available yet but you can use your own hyperparameter to train a **_gpt-3_** like base model.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/YashrajBaila7/GPT2LM",
    "name": "GPT2LM",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "gpt2lm,GPT2LM,gpt,GPT,Language model,Yashraj Baila,AI,Artificial Intelligence,ML,Meachine Learning,Deep Learning,GPT2,gpt2,GPT3,gpt3,GPT4,gpt4,ChatGPT,chatgpt,CHATGPT",
    "author": "Yashraj Baila",
    "author_email": "yashrajbailapython@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/88/b9/ca0c3d43b61672fb59b000b8ccad8fab7b0774b6a5169ed7c46bfc7ed935/GPT2LM-0.0.3.tar.gz",
    "platform": null,
    "description": "# GPT2LM.bigram\r\n\r\nDescription: [Implementation of GPT-2 language model for text generation, training custom models and fine-tuning.]\r\n\r\n---\r\n\r\n## LanguageDataset\r\n\r\n---\r\n\r\nA class representing a language dataset for the Bigram Language Model.\r\n\r\n### Class Description\r\n\r\n#### Constructor\r\n\r\n    def __init__(self, path: str, train_val_split: int = 90)\r\n\r\nInitialize the LanguageDataset with the provided text data.\r\n\r\n##### Parameters\r\n\r\n1. `path` (str): The file path to the text data.\r\n\r\n2. `train_val_split` (int, optional): The percentage split between training and validation data. Defaults to 90 (90% training, 10% validation).\r\n\r\n---\r\n\r\n### Methods\r\n\r\n---\r\n\r\n#### \\_load_dataset\r\n\r\n    def _load_dataset(self, path: str) -> str\r\n\r\nLoad the text data from the given file path.\r\n\r\n##### Parameters\r\n\r\n1. `path` (str): The file path to the text data.\r\n\r\n##### Returns\r\n\r\n1. `text` (str): The loaded text data as a string.\r\n\r\n#### \\_encode\r\n\r\n    def _encode(self, s: str) -> List[int]\r\n\r\nConvert the text data into a list of numerical representations using character encoding.\r\n\r\n##### Parameters\r\n\r\n1. `s` (str): The input text data to be encoded.\r\n\r\n##### Returns\r\n\r\n1. `List[int]`: A list of integer token indices representing the encoded text.\r\n\r\n#### \\_decode\r\n\r\n    def _decode(self, l: List[int]) -> str\r\n\r\nConvert a list of numerical representations back into the original text.\r\n\r\n##### Parameters\r\n\r\n1. `l` (List[int]): A list of integer token indices representing the encoded text.\r\n\r\n##### Returns\r\n\r\n1. `str`: The decoded text as a string.\r\n\r\n#### \\_calculate_split\r\n\r\n    def _calculate_split(self, train_val_split: int) -> float\r\n\r\nCalculate the split index for separating training and validation data.\r\n\r\n##### Parameters\r\n\r\n1. `train_val_split` (int): The percentage split between training and validation data.\r\n\r\n##### Returns\r\n\r\n1. `float`: The split index as a float between 0 and 1, representing the percentage split.\r\n\r\n---\r\n\r\n### Properties\r\n\r\n---\r\n\r\n#### text\r\n\r\n**@property**\r\n\r\n    def text(self) -> str\r\n\r\nGet the loaded text data as a string.\r\n\r\n#### chars\r\n\r\n**@property**\r\n\r\n    def chars(self) -> List[str]\r\n\r\nGet the list of unique characters present in the text data.\r\n\r\n#### vocab_size\r\n\r\n**@property**\r\n\r\n    def vocab_size(self) -> int\r\n\r\nGet the size of the character vocabulary.\r\n\r\n#### stoi\r\n\r\n**@property**\r\n\r\n    def stoi(self) -> Dict[str, int]\r\n\r\nGet a dictionary mapping characters to their corresponding integer token indices.\r\n\r\n#### itos\r\n\r\n**@property**\r\n\r\n    def itos(self) -> Dict[int, str]\r\n\r\nGet a dictionary mapping integer token indices to their corresponding characters.\r\n\r\n#### train_data\r\n\r\n**@property**\r\n\r\n    def train_data(self) -> torch.Tensor\r\n\r\nGet the training data as a tensor containing numerical representations of the text.\r\n\r\n#### val_data\r\n\r\n**@property**\r\n\r\n    def val_data(self) -> torch.Tensor\r\n\r\nGet the validation data as a tensor containing numerical representations of the text.\r\n\r\n---\r\n\r\n---\r\n\r\n## Head\r\n\r\nA single head of self-attention used in the Transformer block.\r\n\r\n### Constructor\r\n\r\n    def __init__(self, head_size, n_embd, block_size, dropout):\r\n\r\nInitialize the Head with linear layers and a lower triangular mask.\r\n\r\n#### Parameters\r\n\r\n1. `head_size` (int): The size of the attention head.\r\n\r\n2. `n_embd` (int): The embedding dimension of the input.\r\n\r\n3. `block_size` (int): The maximum length of the input sequence.\r\n\r\n4. `dropout` (float): The dropout rate.\r\n\r\n### Methods\r\n\r\n#### forward\r\n\r\n    def forward(self, x):\r\n\r\nPerform the forward pass of the attention head.\r\n\r\n##### Parameters\r\n\r\n1. `x` (torch.Tensor): The input tensor of shape `(B, T, C)`, where `B` is the batch size, `T` is the sequence length, and `C` is the embedding dimension.\r\n\r\n##### Returns\r\n\r\n1. `torch.Tensor`: The output tensor after the self-attention operation. It has the same shape as the input tensor `(B, T, C)`.\r\n\r\n---\r\n\r\n---\r\n\r\n## MultiHeadAttention\r\n\r\nMulti-head self-attention module used in the Transformer block.\r\n\r\n### Class Description\r\n\r\n### Constructor\r\n\r\n    def __init__(self, num_heads, head_size, n_embd, block_size, dropout):\r\n\r\nInitialize the MultiHeadAttention with attention heads and projection layer.\r\n\r\n#### Parameters\r\n\r\n1. `num_heads` (int): The number of attention heads.\r\n\r\n2. `head_size` (int): The size of each attention head.\r\n\r\n3. `n_embd` (int): The embedding dimension of the input.\r\n\r\n4. `block_size` (int): The maximum length of the input sequence.\r\n\r\n5. `dropout` (float): The dropout rate.\r\n\r\n---\r\n\r\n### Methods\r\n\r\n---\r\n\r\n#### forward\r\n\r\n    def forward(self, x):\r\n\r\nPerform the forward pass of the multi-head self-attention module.\r\n\r\n##### Parameters\r\n\r\n1. `x` (torch.Tensor): The input tensor of shape `(B, T, C)`, where `B` is the batch size, `T` is the sequence length, and `C` is the embedding dimension.\r\n\r\n##### Returns\r\n\r\n1. `torch.Tensor`: The output tensor after applying the multi-head self-attention. It has the same shape as the input tensor `(B, T, C)`.\r\n\r\n---\r\n\r\n---\r\n\r\n## FeedForward\r\n\r\nFeed-forward neural network module used in the Transformer block.\r\n\r\n### Constructor\r\n\r\n    def __init__(self, n_embd, dropout):\r\n\r\nInitialize the FeedForward with linear layers and activation functions.\r\n\r\n#### Parameters\r\n\r\n1. `n_embd` (int): The embedding dimension of the input.\r\n\r\n2. `dropout` (float): The dropout rate.\r\n\r\n### Methods\r\n\r\n##### forward\r\n\r\n    def forward(self, x):\r\n\r\nPerform the forward pass of the feed-forward neural network.\r\n\r\n##### Parameters\r\n\r\n1. `x` (torch.Tensor): The input tensor of shape `(B, T, C)`, where `B` is the batch size, `T` is the sequence length, and `C` is the embedding dimension.\r\n\r\n##### Returns\r\n\r\n1. `torch.Tensor`: The output tensor after the feed-forward operation. It has the same shape as the input tensor `(B, T, C)`.\r\n\r\n---\r\n\r\n---\r\n\r\n## Block\r\n\r\n---\r\n\r\nTransformer block module used in the BigramLanguageModel.\r\n\r\n### Constructor\r\n\r\n    def __init__(self, n_embd, n_head, block_size, dropout):\r\n\r\nInitialize the Transformer block with self-attention and feed-forward modules.\r\n\r\n#### Parameters\r\n\r\n1. `n_embd` (int): The embedding dimension of the input.\r\n\r\n2. `n_head` (int): The number of attention heads in the multi-head self-attention.\r\n\r\n3. `block_size` (int): The maximum length of the input sequence.\r\n\r\n4. `dropout` (float): The dropout rate.\r\n\r\n---\r\n\r\n### Methods\r\n\r\n---\r\n\r\n#### forward\r\n\r\n    def forward(self, x):\r\n\r\nPerform the forward pass of the transformer block.\r\n\r\n##### Parameters\r\n\r\n1. `x` (torch.Tensor): The input tensor of shape `(B, T, C)`, where `B` is the batch size, `T` is the sequence length, and `C` is the embedding dimension.\r\n\r\n##### Returns\r\n\r\n1. `torch.Tensor`: The output tensor after applying the transformer block. It has the same shape as the input tensor `(B, T, C)`.\r\n\r\n---\r\n\r\n---\r\n\r\n## BigramLanguageModel\r\n\r\n---\r\n\r\n### Class Description\r\n\r\n#### Constructor\r\n\r\n    def __init__(self, vocab_size, n_embd, n_head, n_layer, block_size, dropout)\r\n\r\n#### Parameters\r\n\r\n1. `vocab_size` (int): The size of the vocabulary, which determines the number of tokens in the language model.\r\n\r\n2. `n_embd` (int): The embedding dimension of the input.\r\n\r\n3. `n_head` (int): The number of attention heads in the multi-head self-attention.\r\n\r\n4. `n_layer` (int): The number of Transformer blocks in the model.\r\n\r\n5. `block_size` (int): The maximum length of the input sequence.\r\n\r\n6. `dropout` (float): The dropout rate for regularization.\r\n\r\n---\r\n\r\n### Methods\r\n\r\n---\r\n\r\n#### \\_build_block\r\n\r\n    def _build_block(self) -> Block\r\n\r\nCreates a single Transformer block with specified embedding dimensions and attention heads.\r\n\r\n#### forward\r\n\r\n    def forward(self, idx: torch.Tensor, target: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, Optional[torch.Tensor]]\r\n\r\nPerforms the forward pass of the BigramLanguageModel.\r\n\r\n##### Parameters\r\n\r\n1. `idx` (torch.Tensor): The input tensor representing the token indices of shape (B, T), where B is the batch size, and T is the sequence length.\r\n\r\n2. `target` (torch.Tensor, optional): The target tensor representing the token indices for computing the loss. It should have the same shape as idx. If None, no loss will be computed. (Default: None)\r\n\r\n##### Returns\r\n\r\n1. `torch.Tensor`: The logits tensor after the forward pass of the model. It has the shape (B, T, vocab_size).\r\n\r\n2. `torch.Tensor`: The loss tensor, computed using F.cross_entropy if the target is provided. Otherwise, None. It has the shape (B \\* T,) if target is not None.\r\n\r\n#### generate\r\n\r\n    def generate(self, idx: torch.Tensor, max_new_tokens: int) -> List[int]\r\n\r\nGenerates new tokens using the trained language model.\r\n\r\n##### Parameters\r\n\r\n1. `idx` (torch.Tensor): The input tensor representing the token indices of shape (B, T), where B is the batch size, and T is the sequence length.\r\n\r\n2. `max_new_tokens` (int): The maximum number of new tokens to generate.\r\n\r\n##### Returns\r\n\r\n1. `List[int]`: A list of generated token indices as integers.\r\n\r\n---\r\n\r\n---\r\n\r\n## Utility Functions\r\n\r\n---\r\n\r\n### set_hyperparams\r\n\r\n    def set_hyperparams(**kwargs)\r\n\r\nUpdate the hyperparameters dictionary with the provided keyword arguments.\r\n\r\n#### Parameters\r\n\r\n1. `**kwargs`: Keyword arguments with hyperparameter names as keys and their corresponding values.\r\n\r\n### load_dataset\r\n\r\n    def load_dataset(path: str, train_val_split: int = 90) -> LanguageDataset\r\n\r\nLoad a dataset from the given file path and create a **LanguageDataset** object.\r\n\r\n#### Parameters\r\n\r\n1. `path` (str): The file path of the dataset to load.\r\n\r\n2. `train_val_split` (int, optional): Percentage split between training and validation data. Default is 90, which means 90% training and 10% validation.\r\n\r\n#### Returns\r\n\r\n1. `LanguageDataset` : An instance of the **LanguageDataset** class containing the loaded dataset.\r\n\r\n### initialize_model\r\n\r\n    def initialize_model(vocab_size: int, n_embd: Optional[int] = None, n_head: Optional[int] = None, n_layer: Optional[int] = None, block_size: Optional[int] = None, dropout: Optional[float] = None, device: Optional[str] = None) -> BigramLanguageModel\r\n\r\nInitialize a BigramLanguageModel with the given hyperparameters.\r\n\r\n#### Parameters\r\n\r\n1. `vocab_size` (int): The size of the vocabulary, which determines the number of tokens in the language model.\r\n\r\n2. `n_embd` (int, optional): The embedding dimension of the input. If not provided, it will be set to a default value.\r\n\r\n3. `n_head` (int, optional): The number of attention heads in the multi-head self-attention. If not provided, it will be set to a default value.\r\n\r\n4. `n_layer` (int, optional): The number of Transformer blocks in the model. If not provided, it will be set to a default value.\r\n\r\n5. `block_size` (int, optional): The maximum length of the input sequence. If not provided, it will be set to a default value.\r\n\r\n6. `dropout` (float, optional): The dropout rate for regularization. If not provided, it will be set to a default value.\r\n\r\n7. `device` (str, optional): The device on which to place the model (e.g., 'cpu' or 'cuda'). If not provided, it will be set based on the availability of a GPU.\r\n\r\n#### Returns\r\n\r\n1. `BigramLanguageModel`: An instance of the **BigramLanguageModel** class initialized with the given hyperparameters.\r\n\r\n### train\r\n\r\n    def train(model: BigramLanguageModel, train_data: torch.Tensor, val_data: torch.Tensor, learning_rate: Optional[float] = None, max_iters: Optional[int] = None, eval_interval: Optional[int] = None, device: Optional[str] = None, eval_iters: Optional[int] = None) -> BigramLanguageModel\r\n\r\nTrain the provided model on the given training data and validate it on the validation data.\r\n\r\n#### Parameters\r\n\r\n1. `model` (BigramLanguageModel): The language model to train.\r\n\r\n2. `train_data` (torch.Tensor): The training data tensor.\r\n\r\n3. `val_data` (torch.Tensor): The validation data tensor.\r\n\r\n4. `learning_rate` (float, optional): The learning rate for the optimizer. If not provided, it will be set to a default value.\r\n\r\n5. `max_iters` (int, optional): The maximum number of training iterations. If not provided, it will be set to a default value.\r\n\r\n6. `eval_interval` (int, optional): Interval for printing evaluation results during training. If not provided, it will be set to a default value.\r\n\r\n7. `device` (str, optional): The device on which to train the model (e.g., 'cpu' or 'cuda'). If not provided, it will be set based on the availability of a GPU.\r\n\r\n8. `eval_iters` (int, optional): Number of iterations for evaluation during training. If not provided, it will be set to a default value.\r\n\r\n#### Returns\r\n\r\n1. `BigramLanguageModel`: The trained model.\r\n\r\n### save\r\n\r\n    def save(path: str, model: BigramLanguageModel)\r\n\r\nSave the state_dict of the provided model to the specified file path.\r\n\r\n#### Parameters\r\n\r\n1. `path` (str): The file path where the model state_dict will be saved.\r\n\r\n2. `model` (BigramLanguageModel): The model whose state_dict will be saved.\r\n\r\n### load\r\n\r\n    def load(path: str, model: BigramLanguageModel, device: Optional[str] = None)\r\n\r\nLoad the model state_dict from the specified file path and assign it to the provided model.\r\n\r\n#### Parameters\r\n\r\n1. `path` (str): The file path from which the model state_dict will be loaded.\r\n\r\n2. `model` (BigramLanguageModel): The model to which the loaded state_dict will be assigned.\r\n\r\n3. `device` (str, optional): The device on which to load the model (e.g., 'cpu' or 'cuda'). If not provided, it will be set based on the availability of a GPU.\r\n\r\n### fine_tune_model\r\n\r\n    def fine_tune_model(model: BigramLanguageModel, learning_rate: float, max_iters: int, eval_interval: int, train_data: torch.Tensor, val_data: torch.Tensor) -> BigramLanguageModel\r\n\r\nFine-tune the provided model using the given training data.\r\n\r\n#### Parameters\r\n\r\n1. `model` (BigramLanguageModel): The model to be fine-tuned.\r\n\r\n2. `learning_rate` (float): The learning rate for the optimizer during fine-tuning.\r\n\r\n3. `max_iters` (int): The maximum number of iterations for fine-tuning.\r\n\r\n4. `eval_interval` (int): The interval at which to evaluate the model's performance during fine-tuning.\r\n\r\n5. `train_data` (torch.Tensor): The training data for fine-tuning.\r\n\r\n6. `val_data` (torch.Tensor): The validation data for evaluating the model's performance during fine-tuning.\r\n\r\n#### Returns\r\n\r\n1. `BigramLanguageModel`: The fine-tuned model.\r\n\r\n### gpt2_124M\r\n\r\n    def gpt2_124M(training_data_path: str, lr: float, max_iters: int, dropout: float = 0.1, eval_iters: int = 200, train_val_split: int = 85) -> BigramLanguageModel\r\n\r\nTrain a GPT-2 model with 124 million parameters on the provided training data.\r\n\r\n#### Parameters\r\n\r\n1. `training_data_path` (str): The path to the training data file.\r\n\r\n2. `lr` (float): The learning rate for the optimizer during training.\r\n\r\n3. `max_iters` (int): The maximum number of iterations for training.\r\n\r\n4. `dropout` (float, optional): The dropout rate to be used in the model. Default is 0.1.\r\n\r\n5. `eval_iters` (int, optional): The interval at which to evaluate the model's performance during training. Default is 200.\r\n\r\n6. `train_val_split` (int, optional): The percentage of data to be used for training, and the rest for validation. Default is 85.\r\n\r\n#### Returns\r\n\r\n1. `BigramLanguageModel`: The trained GPT-2 model.\r\n\r\n#### NOTE\r\n\r\n1. **_`Parameters count depends on the vocab_size.`_**\r\n\r\n### gpt2_finetune\r\n\r\n    def gpt2_finetune(pretrained_model: BigramLanguageModel, training_data_path: str, lr: float, max_iters: int, eval_iters: int = 200, train_val_split: int = 85) -> BigramLanguageModel\r\n\r\nFine-tune a pretrained GPT-2 model on the provided training data.\r\n\r\n#### Parameters\r\n\r\n1. `pretrained_model` (BigramLanguageModel): The pretrained GPT-2 model to be fine-tuned.\r\n\r\n2. `training_data_path` (str): The path to the training data file.\r\n\r\n3. `lr` (float): The learning rate for the optimizer during fine-tuning.\r\n\r\n4. `max_iters` (int): The maximum number of iterations for fine-tuning.\r\n\r\n5. `eval_iters` (int, optional): The interval at which to evaluate the model's performance during fine-tuning. Default is 200.\r\n\r\n6. `train_val_split` (int, optional): The percentage of data to be used for training, and the rest for validation. Default is 85.\r\n\r\n#### Returns\r\n\r\n1. `BigramLanguageModel`: The fine-tuned model.\r\n\r\n### count_parameters\r\n\r\n    def count_parameters(model: torch.nn.Module) -> int\r\n\r\nCount the total number of trainable parameters in the given PyTorch model.\r\n\r\n#### Parameters\r\n\r\n1. `model` (torch.nn.Module): The PyTorch model for which the parameters need to be counted.\r\n\r\n#### Returns\r\n\r\n1. `int`: The total number of trainable parameters in the model.\r\n\r\n### get_batch\r\n\r\n    def get_batch(split: str, train_data: torch.Tensor, val_data: torch.Tensor, block_size: int = None, batch_size: int = None, device: str = None) -> Tuple[torch.Tensor, torch.Tensor]\r\n\r\nGet a batch of data for training or validation from the specified split.\r\n\r\n#### Parameters\r\n\r\n1. `split` (str): The split to get the data from ('train' or 'val').\r\n\r\n2. `train_data` (torch.Tensor): The training data tensor.\r\n\r\n3. `val_data` (torch.Tensor): The validation data tensor.\r\n\r\n4. `block_size` (int, optional): The size of each input sequence block. Defaults to None.\r\n\r\n5. `batch_size` (int, optional): The number of sequences in a batch. Defaults to None.\r\n\r\n6. `device` (str, optional): The device to store the batch data on ('cpu' or 'cuda'). Defaults to None.\r\n\r\n#### Returns\r\n\r\n1. `Tuple`[torch.Tensor, torch.Tensor]: A tuple containing two tensors - the input tensor representing the batch of sequences (x) and the target tensor representing the batch of sequences shifted by one (y).\r\n\r\n### estimate_loss\r\n\r\n    def estimate_loss(model: torch.nn.Module, train_data: torch.Tensor, val_data: torch.Tensor, eval_iters: int = None, block_size: int = None, batch_size: int = None, device: str = None) -> Dict[str, float]\r\n\r\nEstimate the average loss of the model on the training and validation data.\r\n\r\n### Parameters\r\n\r\n1. `model` (torch.nn.Module): The PyTorch model for which the loss is to be estimated.\r\n\r\n2. `train_data` (torch.Tensor): The training data tensor.\r\n\r\n3. `val_data` (torch.Tensor): The validation data tensor.\r\n\r\n4. `eval_iters` (int, optional): The number of iterations for estimating the loss. Defaults to None.\r\n\r\n5. `block_size` (int, optional): The size of each input sequence block. Defaults to None.\r\n\r\n6. `batch_size` (int, optional): The number of sequences in a batch. Defaults to None.\r\n\r\n7. `device` (str, optional): The device to perform computations on ('cpu' or 'cuda'). Defaults to None.\r\n\r\n#### Returns\r\n\r\n1. `Dict`[str, float]: A dictionary containing the average loss on the training and validation data.\r\n\r\n   _`Keys`_: 'train' and 'val'\r\n\r\n   _`Values`_: Average loss values (float)\r\n\r\n### default_hyperparams\r\n\r\n    default_hyperparams = {\r\n                                'batch_size': 64,\r\n                                'block_size': 256,\r\n                                'max_iters': 5000,\r\n                                'learning_rate': 5.859375e-05,\r\n                                'device': 'cuda' if torch.cuda.is_available() else 'cpu',\r\n                                'eval_iters': 200,\r\n                                'n_embd': 384,\r\n                                'n_head': 6,\r\n                                'n_layer': 6,\r\n                                'dropout': 0.2,\r\n\r\n    }\r\n\r\nA dictionary containing the default hyperparameters for the GPT-2 model.\r\n\r\n### set_hyperparamsdefault\r\n\r\n      def set_hyperparamsdefault()\r\n\r\nSet the default hyperparameters for the GPT-2 model using the set_hyperparams function.\r\n\r\n---\r\n\r\n## Example Usage\r\n\r\n### Train model on default Hyperparameters\r\n\r\n    # Import the required functions and classes\r\n\r\n    from GPT2LM import bigram as b\r\n\r\n    # Path to your training data file\r\n\r\n    training_data_path = 'path_to_your_training_data.txt'\r\n\r\n    # Load the dataset and split into training and validation sets (default split: 85% training, 15% validation)\r\n\r\n    dataset = b.load_dataset(training_data_path)\r\n\r\n    # set default hyperparams\r\n\r\n    b.set_hyperparams()\r\n\r\n    # Initialize the GPT-2 model with default hyperparameters and train on the dataset\r\n\r\n    model = b.initialize_model(vocab_size=dataset.vocab_size) # keep record of the 'vocab_size' to use later.\r\n\r\n    # Train the model on the training data and validate it on the validation data\r\n\r\n    trained_model = b.train(model, dataset.train_data, dataset.val_data)\r\n\r\n    # Save the trained model to a file\r\n\r\n    b.save('path_to_save_model.pt', trained_model)\r\n\r\n### Train model on custom hyperparameters\r\n\r\n    # Import the required functions and classes\r\n\r\n    from GPT2LM import bigram as b\r\n\r\n    # Path to your training data file\r\n\r\n    training_data_path = 'path_to_your_training_data.txt'\r\n\r\n    # Load the dataset and split into training and validation sets (default split: 85% training, 15% validation)\r\n\r\n    dataset = b.load_dataset(training_data_path)\r\n\r\n    # set custom hyperparams\r\n\r\n    3 This will assign custom values to default hyperparams\r\n\r\n    batch_size=32\r\n    block_size=500\r\n    max_iters=10000\r\n    learning_rate=0.0001\r\n    eval_interval=250\r\n    n_embd=800\r\n    n_head=8\r\n    n_layer=8\r\n    dropout=0.1\r\n\r\n    b.set_hyperparams(batch_size=batch_size,\r\n                    block_size=block_size,\r\n                    max_iters=max_iters,\r\n                    learning_rate=learning_rate,\r\n                    eval_interval=eval_interval,\r\n                    n_embd=n_embd,\r\n                    n_head=n_head,\r\n                    n_layer=n_layer,\r\n                    dropout=dropout\r\n    )\r\n\r\n    # Initialize the GPT-2 model with custom hyperparameters and train on the dataset\r\n\r\n    model = b.initialize_model(vocab_size=dataset.vocab_size) # keep record of the 'vocab_size' to use later.\r\n\r\n    # Train the model on the training data and validate it on the validation data\r\n\r\n    trained_model = b.train(model, dataset.train_data, dataset.val_data)\r\n\r\n    # Save the trained model to a file\r\n\r\n    b.save('path_to_save_model.pt', trained_model)\r\n\r\n### Generate text\r\n\r\n    # initiate context\r\n    context=torch.zeros((1,1),dtype=torch.long,device=device)\r\n\r\n    # generate the new tokens and decode them into text\r\n    text=dataset._decode(model.generate(context,max_new_tokens=200))\r\n\r\n### Generate text with prompt\r\n\r\n    # initiate context\r\n    context=torch.zeros((1,1),dtype=torch.long,device=device)\r\n\r\n    # Give prompt\r\n\r\n    prompt = \"To be or not to be, that is the question.\"\r\n\r\n    # encode prompt into tokens\r\n\r\n    prompt_encoded = torch.tensor(dataset._encode(prompt), dtype=torch.long, device=device)\r\n    prompt_encoded = prompt_encoded.unsqueeze(0)\r\n\r\n    # initiate context with prompt\r\n\r\n    context_with_prompt = torch.cat((context, prompt_encoded), dim=1)\r\n\r\n    # generate the new tokens and decode them into text\r\n\r\n    text=dataset._decode(model.generate(context_with_prompt,max_new_tokens=200))\r\n\r\n## About\r\n\r\nThis project is designed for educational purposes to help you learn and understand how generative AI works. It demonstrates the implementation of a GPT-2 variant, a popular generative language model, and provides a hands-on experience in training language models on custom datasets. By exploring the code and running the provided examples, you can gain insights into the underlying principles of generative AI and natural language processing.\r\n\r\nPlease note that while this project serves as a learning resource, it is essential to be mindful of the ethical considerations and potential risks associated with AI models. The use of AI technology, especially in generating text, should be done responsibly and with respect for ethical guidelines.\r\n\r\n\r\n\r\n### How GPT works\r\n\r\nGenerative Pre-trained Transformer (GPT) models, such as GPT-2, are designed to predict the next word or token in a sequence of text. They achieve this by leveraging the power of self-attention mechanisms within the Transformer architecture.\r\n\r\nAt a high level, here's how GPT models work to predict the next word or token:\r\n\r\n1. `Language Modeling`: GPT models are trained on large amounts of text data using a process called language modeling. During training, the model learns the statistical patterns and dependencies present in the input text. It captures the relationships between different words and the likelihood of one word following another in a sentence.\r\n\r\n2. `Contextual Understanding`: The key feature of GPT models is their ability to understand the context in which a word or token appears. Instead of treating each word in isolation, the model considers the entire sequence of tokens and assigns higher importance to the tokens that are most relevant for understanding the current word.\r\n\r\n3. `Self-Attention`: Self-attention is the mechanism that allows GPT models to weigh the importance of different tokens in the input sequence based on their contextual relevance. Tokens that are semantically related or have strong dependencies receive higher attention scores, while unrelated tokens receive lower attention scores.\r\n\r\n4. `Autoregressive Generation`: To predict the next word or token in a sequence, GPT models use an autoregressive generation process. This means that the model predicts each token one by one, conditioned on the tokens that came before it in the sequence. The previously predicted tokens serve as context to guide the prediction of the next token.\r\n\r\n5. `Text Completion`: By repeatedly predicting the next token and using the predicted tokens as context, the GPT model can complete a given prompt or generate entirely new text. The generated text is coherent and contextually appropriate because the model has learned from vast amounts of text data during pre-training.\r\n\r\n6. `Adaptability`: GPT models are highly adaptable and can be fine-tuned on specific tasks or domains after pre-training. This process allows them to perform well on a wide range of natural language processing tasks, such as text generation, machine translation, question answering, and more.\r\n\r\nIn summary, GPT models are powerful language models that predict the next word or token in a sequence by understanding the context and dependencies between words. Their ability to generate coherent and contextually relevant text makes them valuable tools for various natural language processing applications.\r\n\r\n#### How we trick them to chat with user\r\n\r\nAs these models are designed to complete given documents, they are not capable of engaging in real-time conversations with users. If you present them with a question, they will simply generate more questions that resemble the input. To overcome this limitation, we use a specific format for interaction.\r\n\r\nFor instance, we frame the conversation as if it's between a human and an AI. This format helps the AI understand its role and context better. Here's an example prompt:\r\n\r\n\"AI: What can I do for you today?\r\n\\nHuman: ...\"\r\n\r\nIn this format, the human provides their input after \"Human:\", and the AI responds accordingly. The entire conversation is treated as a single document, with the AI's responses generated by the model and the human's responses provided by an actual person.\r\n\r\nThis approach allows us to harness the power of the language model to generate creative and contextually relevant AI responses, while human input ensures that the conversation remains meaningful and coherent.\r\n\r\nBy using this interaction format, we can effectively leverage the capabilities of the language model in completing text documents, making it a useful tool for various natural language processing tasks.\r\n\r\n### Installation\r\n\r\n    pip install GPT2LM\r\n\r\n### What is new in this version?\r\n\r\nThere are no such big changes in bigram, the only difference is the use of the **`GELU`** activation function instead of the **`ReLU`** activation function.\r\n\r\n`ReLU(x)` or `Rectified Linear Unit` = max(0,x)\r\n\r\n`GELU(x)` or `Gaussian Error Linear Unit` = x\u03a6(x) = 0.5x(1+tanh[2/\u03c0(x+0.044715x^3)])\r\n\r\n#### Why GELU is used?\r\n\r\nThe `Gaussian Error Linear Unit (GELU)` is an activation function used in deep learning. It is defined as `x\u03a6(x)`, where `\u03a6(x)` is the `standard Gaussian cumulative distribution function`. The GELU nonlinearity weights inputs by their percentile, rather than gating inputs by their sign as in ReLUs. Consequently, the GELU can be thought of as a smoother ReLU.\r\n\r\nIt is not known why certain activation functions work better than others in different contexts. So the only answer for \u201cwhy use GELU instead of ReLu\u201d is \"because it works better\". However, there are some possible explanations. For example, ReLU can suffer from \"problems where a significant amount of neurons in the network become zero and don\u2019t practically do anything\". GELU is smoother near zero and \u201cis differentiable in all ranges, and allows to have gradients (although small) in the negative range\u201d which helps with this problem.\r\n\r\nIn summary, GELU is used in generative AI instead of ReLU because it has been found to work better in practice. It is a smoother version of ReLU that allows for gradients in the negative range, which can help prevent problems with neurons becoming inactive.\r\n\r\n#### But why even use activation functions?\r\n\r\nActivation functions are used in neural networks to introduce non-linearity into the output of a neuron. Without an activation function, the weights and bias would simply do a linear transformation. A linear equation is simple to solve but is limited in its capacity to solve complex problems and have less power to learn complex functional mappings from data. A neural network without an activation function is just a **`linear regression model`**.\r\n\r\nGenerally, neural networks use non-linear activation functions, which can help the network learn complex data, compute and learn almost any function representing a question, and provide accurate predictions. Activation functions provide the building blocks that can be used repeatedly in two dimensions of the network structure so that, combined with an attenuation matrix to vary the weight of signaling from layer to layer, is known to be able to approximate an arbitrary and complex function.\r\n\r\nIn summary, activation functions are used in neural networks to introduce non-linearity into the output of a neuron, which allows the network to learn complex data and provide accurate predictions. They are essential building blocks of neural networks.\r\n\r\n### Is the model is GPT (GPT VS Bigram) ?\r\n\r\nA bigram is a type of n-gram model that represents text as the joint probability of a sequence of two words. Bigrams, along with other n-grams, are used in most successful language models for speech recognition.\r\n\r\nOn the other hand, GPT (Generative Pretrained Transformer) is a type of neural language model that uses a transformer architecture to generate text. GPT is a modern Transformer language model, which is more complex than a bigram character-level language model.\r\n\r\nIn summary, a bigram is a simple statistical model that represents text as the joint probability of a sequence of two words, while GPT is a more complex neural language model that uses a transformer architecture to generate text. The two models have different strengths and weaknesses and can be used for different purposes.\r\n\r\n#### But it does not answer my question.\r\n\r\nCertainly, let's differentiate between `GPT` (a type of transformer model), a `bigram` model, and the `custom BigramLanguageModel` :\r\n\r\n1. **`GPT (Generative Pre-trained Transformer)`**:\r\n\r\n   1. `Architecture`: GPT is based on the transformer architecture, which consists of a stack of multiple layers of self-attention and feedforward neural networks.\r\n   2. `Self-Attention`: GPT uses multi-head self-attention to capture relationships between tokens in the input sequence.\r\n   3. `Parameter Sharing`: GPT is a fully autoregressive model, meaning it generates one token at a time based on the previously generated tokens. It doesn't rely on fixed n-grams like bigram models.\r\n   4. `Pre-training`: GPT is pre-trained on a large corpus of text data, learning language patterns, and semantic understanding from diverse text sources.\r\n   5. `Generative`: GPT is designed for generating text and is commonly used for tasks like text completion, text generation, and language translation.\r\n\r\n2. **`Bigram Model`**:\r\n\r\n   1. `Architecture`: A bigram model is a simple statistical language model that focuses on pairs of consecutive words (bigrams) in a text corpus.\r\n   2. `Scope`: It doesn't have a deep neural network architecture like transformers or GPT. Instead, it relies on counting the frequency of bigrams in a given text corpus.\r\n   3. `Parameterization`: Bigram models have a relatively small number of parameters compared to deep learning models.\r\n   4. `Generative Power`: Bigram models are less expressive than GPT and cannot capture complex language structures or long-range dependencies in the same way.\r\n   5. `Use Case`: Bigram models are often used for basic language modeling tasks but are less suited for generating coherent and contextually rich text compared to GPT.\r\n\r\n3. **`Custom Model (BigramLanguageModel)`**:\r\n\r\n   1. `Architecture`:This model a custom implementation. It combines elements of self-attention and feedforward neural networks within each block.\r\n   2. `Embeddings`: It uses token embeddings and positional embeddings, which are common in transformer-based models.\r\n   3. `Multi-Head Attention`: Like GPT, it incorporates multi-head self-attention within each block.\r\n   4. `Complexity`: This custom model is more complex and expressive than a simple bigram model but may not have the same generative power or pre-training as GPT.\r\n\r\nIn summary, GPT is a state-of-the-art transformer-based model designed for generative tasks, bigram models are simpler statistical models focusing on word pairs, and the custom model combines elements of transformers but is a custom implementation tailored for a specific purpose. Each model has its own strengths and weaknesses depending on the task at hand.\r\n\r\n### What you have to do.\r\n\r\nGo to Colab or Kaggle and train two models, both on the same dataset but with different versions of GPT2LM, to understand how the model works with different activation functions and why generative AI like GPT uses GELU instead of ReLU, probably the most popular activation function.\r\n\r\nAlso implement your own ReLU and GELU function and plot the graph of them using matplotlib on two or three different ranges to see the actual difference in both functions because on large ranges the graph will overlap each other and appears as the same graph. So on smaller ranges [-1,1] you will see a very small difference in graphs.\r\n\r\n\r\n# GPT2LM.gpt2\r\n\r\nDescription: [Implementation of GPT-2 language model for text generation, training custom models and fine-tuning.]\r\n\r\n## Utils Module Documentation\r\n\r\n### Overview\r\n\r\nThe `utils` module provides various utility functions used for common tasks such as setting random seeds for reproducibility, setting up logging for experiments, loading/saving models and data, and generating text using language models. Additionally, it includes a lightweight configuration class inspired by YACS (`CfgNode`) for managing experiment configurations.\r\n\r\n### Functions\r\n\r\n1. `set_seed(seed)`\r\n   - Description: Set the random seeds for reproducibility across different runs.\r\n   - Parameters:\r\n   - `seed` (int): The seed value to set for random number generators. \r\n2. `setup_logging(config)`\r\n   - Description: Set up logging for the experiment. This function creates necessary directories and logs configurations.\r\n   - Parameters:\r\n     - `config` (object): Configuration object containing experiment settings.\r\n3. `CfgNode`\r\n    - Description: A lightweight configuration class inspired by YACS. This class provides a flexible way to manage configurations for experiments or applications.\r\n    - `load_text(path)`\r\n    - Description: Load text data from a file.\r\n    - Parameters:\r\n      - `path` (str): The path to the text file.\r\n    - Returns:\r\n      - `str` or `None`: The content of the text file as a string if successful, otherwise `None`.\r\n4. `load_text(path)`\r\n\r\n- **Description**: Load text data from a file.\r\n\r\n- **Parameters**:\r\n  - `path` (str): The path to the text file.\r\n\r\n- **Returns**:\r\n  - `str` or `None`: The content of the text file as a string if successful, otherwise `None`.\r\n\r\n- **Raises**:\r\n  - `FileNotFoundError`: If the specified file path does not exist.\r\n  - `Exception`: If an error occurs while reading the file.\r\n\r\n 5. `load_json_to_text(path)`\r\n\r\n- **Description**: Load JSON data from a file and convert it into a single text string.\r\n\r\n- **Parameters**:\r\n  - `path` (str): The path to the JSON file.\r\n\r\n- **Returns**:\r\n  - `str` or `None`: A single text string containing the JSON data if successful, otherwise `None`.\r\n\r\n- **Raises**:\r\n  - `FileNotFoundError`: If the specified file path does not exist.\r\n  - `Exception`: If an error occurs while reading the JSON file or converting it to text.\r\n\r\n---\r\n\r\n6. `load_csv_to_text(path)`\r\n\r\n- **Description**: Load data from a CSV file and convert it into a single text string.\r\n\r\n- **Parameters**:\r\n  - `path` (str): The path to the CSV file.\r\n\r\n- **Returns**:\r\n  - `str` or `None`: A single text string containing the CSV data if successful, otherwise `None`.\r\n\r\n- **Raises**:\r\n  - `FileNotFoundError`: If the specified file path does not exist.\r\n  - `Exception`: If an error occurs while reading the CSV file or converting it to text.\r\n\r\n  1. `save(path, model)`\r\n     - Description: Save the model state dictionary to a file.\r\n      - Parameters:\r\n        - `path` (str): Path to save the model state dictionary.\r\n        - `model `(`torch.nn.Module`): Model to save.\r\n  2. `load(path, model, device=None)`\r\n      - Description: Load the model state dictionary from a file.\r\n      - Parameters:\r\n        - `path` (str): Path to the saved model state dictionary.\r\n        - `model` (`torch.nn.Module`): Model to load the state dictionary into.\r\n        - `device` (str, optional): Device to load the model onto (e.g., 'cuda' or 'cpu'). If None, it will automatically detect CUDA availability.\r\n     - Returns:\r\n       - `None`\r\n  3. `Generate(model, prompt, encoder_path, max_len)`\r\n     - Description: Generate text based on a given prompt using the provided model.\r\n     - Parameters:\r\n       - `model` (`torch.nn.Module`): The language model used for text generation.\r\n       - `prompt` (str): The prompt to start text generation.\r\n       - `encoder_path` (str): Path to the encoder file for encoding/decoding characters.\r\n       - `max_len` (int): Maximum length of the generated text.\r\n     - Returns:\r\n        - `str`: Generated text based on the prompt.\r\n  4.  `print_config()`\r\n      - Description: Function to print the configurations of various GPT models.\r\n      - Prints the configurations of different GPT models including the number of layers, attention heads, and embedding size.\r\n  ### Classes\r\n  - `CfgNode`: Lightweight configuration class inspired by YACS.\r\n  ### Dependencies\r\n  - `os`, `sys`, `csv`, `math`, `json`, `time`, `random`, `requests`, `regex`, `numpy`, `tqdm`:  Standard Python libraries for various tasks.\r\n  - `torch`, `torch.nn`, ``torch.utils.data`: PyTorch library for deep learning.\r\n  - `torch.utils.data.dataloader`: DataLoader class for loading data in batches.\r\n  ### Global Variables\r\n  - `device`: Represents the device on which tensors reside, either \"cuda\" (GPU) if available or \"cpu\" if not available.\r\n ### Notes\r\n - Ensure proper installation of PyTorch and other dependencies to utilize all functions effectively.\r\n  \r\n  ## GPT Language Model\r\n\r\n  The GPT (Generative Pre-trained Transformer) Language Model is a powerful model architecture based on the Transformer architecture. This implementation includes functionalities for training and generation of text.\r\n\r\n  ### Model Architecture:\r\n\r\n  1. **Embedding Layers:**\r\n     - Token Embeddings: Embeds input tokens into a continuous vector space.\r\n     - Position Embeddings: Embeds the position of each token in the input sequence.\r\n  2. **Transformer Blocks:**\r\n     - Each block consists of a causal self-attention layer followed by a feed-forward neural network (MLP).\r\n     - Layer normalization and residual connections are applied within each block.\r\n  3. **Language Model Head:**\r\n     - Linear layer projecting the output of the transformer blocks into logits over the vocabulary.\r\n\r\n  ### Key Features:\r\n\r\n  1. **Configurable:** \r\n     - The model architecture can be configured using a `CfgNode` object, allowing customization of the number of layers, heads, embedding size, and other parameters.\r\n   2. **Optimization:**\r\n      - Provides methods for configuring optimizers for training, including options for weight decay and learning rate scheduling.\r\n   3. **Generation:**\r\n      - Includes functionality for generating new sequences of text given a conditioning sequence.\r\n   4. **Initialization:**\r\n      - Parameters are initialized following best practices for transformer-based models, including custom initialization for certain layers.\r\n  ### Parameters:\r\n  - `config`: Configuration object containing model parameters.\r\n  - `idx`: Input tensor representing token indices.\r\n  - `targets`: Target tensor for training.\r\n  - `max_new_tokens`: Maximum number of tokens to generate during text generation.\r\n  - `temperature`: Temperature parameter for sampling during generation.\r\n  - `do_sample`: Boolean indicating whether to sample from the distribution during generation.\r\n  - `top_k`: Number of top-k tokens to consider during sampling.\r\n  ### Methods:\r\n  - `forward(idx, targets)`: Performs a forward pass through the model.\r\n  - `generate(idx, max_new_tokens, temperature, do_sample, top_k)`: Generates new tokens given a conditioning sequence.\r\n  ### Attributes:\r\n  - `block_size`: Maximum block size for input sequences.\r\n  - `device`: Device (CPU or GPU) where the model is currently located.\r\n  ## CharDataset\r\n  The `CharDataset` class is designed for handling character-level text data. It prepares the data for training by encoding characters into numerical indices. This dataset is suitable for language modeling tasks where the model predicts the next character in a sequence given the previous characters.\r\n\r\n  ### Initialization:\r\n       CharDataset(data, block_size, output_dir='dir/')\r\n\r\n- `data (str)`: Text data to be processed.\r\n- `block_size (int)`: Size of the text blocks used for training.\r\n- `output_dir (str, optional)`: Output directory to save the encoder JSON file.\r\n\r\n### Attributes:\r\n- `vocab_size`: Number of unique characters in the dataset.\r\n- `block_size`: Size of the text blocks.\r\n- `data`: Raw text data.\r\n  ### Methods:\r\n\r\n- `get_vocab_size()`: Get the vocabulary size.\r\n- `get_block_size()`: Get the block size.\r\n- `__len__()`: Get the length of the dataset.\r\n- `__getitem__(idx)`: Get an item from the dataset by index.\r\n---\r\n\r\n## Encoder\r\n\r\nThe `Encoder` class provides functionality for encoding and decoding text using an encoder dictionary. It loads the encoder dictionary from a JSON file and provides methods to encode and decode text.\r\n\r\n### Initialization:\r\n    Encoder(encoder_path)\r\n\r\n- `encoder_path (str)`: Path to the encoder JSON file.\r\n### Methods:\r\n- `encode(s)`: Encode a string into a list of integers. \r\n- `decode(l)`: Decode a list of integers into a string.\r\n  ### Example Usage:\r\n\r\n      encoder = Encoder(\"encoder.json\")\r\n      encoded_text = encoder.encode(\"Hello, world!\")\r\n      decoded_text = encoder.decode(encoded_text)\r\n\r\n---\r\n\r\n## FineEncoderDataset\r\n\r\nThe `FineEncoderDataset` class is similar to `CharDataset` but allows for fine control over the encoding process. It requires an external encoder file (JSON format) for encoding and decoding the text data.\r\n\r\n### Initialization:\r\n\r\n    FineEncoderDataset(data, encoder_path, block_size)\r\n\r\n- `data (str)`: Input data for the dataset.\r\n- `encoder_path (str)`: Path to the encoder JSON file.\r\n- `block_size (int)`: Size of each block in the dataset.\r\n\r\n### Methods:\r\n\r\n- `encode(s)`: Encode the input string using the encoder.\r\n- `decode(l)`: Decode the input list using the encoder.\r\n\r\n### Example Usage:\r\n\r\n    dataset = FineEncoderDataset(data, \"encoder.json\", block_size=128)\r\n\r\n---\r\n\r\n## Trainer\r\n\r\nThe `Trainer` class is responsible for training neural networks. It provides functionality for running the training loop, handling data loading, optimization, and callback triggering.\r\n\r\n### Initialization:\r\n\r\n    Trainer(config, model, train_dataset)\r\n\r\n- `config (CfgNode)`: Configuration for the Trainer.\r\n- `model (nn.Module)`: The neural network model.\r\n- `train_dataset`: The training dataset.\r\n\r\n### Attributes:\r\n- `device`: Device (CPU or GPU) used for training.\r\n- `optimizer`: Optimizer used for training.\r\n- `callbacks`: Dictionary storing callback functions for different events.\r\n- `iter_num`: Current iteration number.\r\n- `iter_time`: Time at the current iteration.\r\n- `iter_dt`: Time difference between the current and previous iteration.\r\n \r\n### Methods:\r\n- `add_callback(onevent, callback)`: Add a callback function to be triggered on an event.\r\n- `set_callback(onevent, callback)`: Set a callback function to be triggered on an event.\r\n- `trigger_callbacks(onevent)`: Trigger the callbacks for a specific event.\r\n- `run(num_epochs=None)`: Run the training loop.\r\n  \r\n  ### Example Usage:\r\n\r\n      trainer = Trainer(config, model, train_dataset)\r\n      trainer.run(num_epochs=10)\r\n\r\n\r\n---\r\n\r\n## FineTuner\r\n\r\nThe `FineTuner` class is used for fine-tuning a pre-trained model with new data. It initializes the training process with the provided model, data, encoder path, block size, and training configuration.\r\n\r\n### Initialization:\r\n\r\n    FineTuner(model, data, encoder_path, block_size, train_config)\r\n\r\n- `model (torch.nn.Module)`: Model to fine-tune.\r\n- `data (str)`: Input data for fine-tuning.\r\n- `encoder_path (str)`: Path to the encoder file.\r\n- `block_size (int)`: Size of each block in the dataset.\r\n- `train_config (CfgNode)`: Training configuration for the fine-tuning process.\r\n\r\n### Attributes:\r\n\r\n- `device`: Device (CPU or GPU) used for fine-tuning.\r\n\r\n### Methods:\r\n\r\n- `fine_tune(epoch=1)`: Perform fine-tuning of the model for the specified number of epochs.\r\n\r\n### Example Usage:\r\n\r\n    fine_tuner = FineTuner(model, data, encoder_path, block_size, train_config)\r\n    fine_tuner.fine_tune(epoch=5)\r\n\r\n\r\n---\r\n\r\n## Training a model\r\n\r\n    from GPT2LM import gpt2\r\n\r\n    block_size=1024 #Example block_size, you can choose your own\r\n    text=gpt2.load_text('path to .txt) # Can use load_csv() or load_json() for csv and json \r\n    train_dataset=gpt2.CharDataset(text, block_size, 'out_dir') # Encoder.json will be save in out_dir\r\n\r\n    model_config = gpt2.GPT.get_default_config()\r\n    model_config.model_type = 'gpt-micro' # more model typesare available, use `print_config()` to see all\r\n    model_config.vocab_size = train_dataset.vocab_size\r\n    model_config.block_size = block_size  \r\n    model = gpt2.GPT(model_config)\r\n\r\n\r\n    train_config = gpt2.Trainer.get_default_config()\r\n    train_config.learning_rate = 5e-4 \r\n    train_config.max_iters = 1000 #iters per epoch\r\n    train_config.batch_size = 64\r\n    trainer = gpt2.Trainer(train_config, model, train_dataset)\r\n    trainer.run(10) # number of epochs = 10\r\n\r\n    print(gpt2.Generate(model, 'Hi', 'path_to_encoder.json', 200))  # 'Hi' is prompt and 200 is the maximum length of generated response means 200 characters will be generated\r\n\r\n\r\n## Fine Tuning a model\r\n\r\n    from GPT2LM import gpt2\r\n\r\n    block_size=1024  # same as pre-trained model\r\n\r\n    model_config = gpt2.GPT.get_default_config()\r\n    model_config.model_type = 'gpt-micro'\r\n    model_config.vocab_size = n # same as pre-trained model\r\n    model_config.block_size = block_size  \r\n    model = gpt2.GPT(model_config)\r\n\r\n    load('path_to_pre_trained_model', model) # Load model weights from pre-trained model\r\n    \r\n    encoder_path = \"path_to_encoder.json\"\r\n    fine_data=gpt2.load_text('path_to_txt_file)'\r\n    fine_config = gpt2.Trainer.get_default_config()\r\n    fine_config.learning_rate = 5e-4\r\n    fine_config.max_iters = 1000\r\n    fine_config.batch_size = 64\r\n    finetuner = gpt2.FineTuner(model, fine_data, encoder_path, block_size, fine_config)\r\n    finetuner.fine_tune(3)\r\n\r\n\r\n## License\r\n\r\nThis project is licensed under the [MIT License](LICENSE).\r\n\r\n## Talks\r\nYou can check the model `Model_Card` for all the available model types in `GPT2LM.gpt2`.\r\n\r\nHere are some links for better understanding.\r\n\r\n1. **_`Attention Is All You Need`_** [link here](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)\r\n\r\n2. **_`Textbooks Are All You Need`_** [Link here](https://www.researchgate.net/publication/371729045_Textbooks_Are_All_You_Need)\r\n\r\n3. **_`GPT(Generative Pre-trained Transformers)`_** [Link here](https://aws.amazon.com/what-is/gpt/)\r\n\r\n4. **_`Language Models are Unsupervised Multitask Learners`_** [Link here](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf)\r\n\r\n5. **_`GPT-2`_** [Link here](https://en.wikipedia.org/wiki/GPT-2)\r\n6. **_`GELU`_** [Link here](https://paperswithcode.com/method/gelu)\r\n7. **_`ReLU`_** [Link here](https://paperswithcode.com/method/relu)\r\n8. **_`ReLU VS GELU`_** [Link here](https://towardsai.net/p/l/is-gelu-the-relu-successor)\r\n9. **_`GIT Link of this project`_** [Link here](https://github.com/YashrajBaila7/GPT2LM)\r\n\r\n10. **_`PyPI Link of this project`_** [Link here](https://pypi.org/project/GPT2LM/)\r\n\r\n\r\nCode is completely open source so you can play around it and get a better understanding of it.\r\n\r\n**_GPT-3_** Hyperparameter are not publicly available yet but you can use your own hyperparameter to train a **_gpt-3_** like base model.\r\n\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Implementation of GPT-2 language model for text generation, training custom models and fine-tuning.",
    "version": "0.0.3",
    "project_urls": {
        "Homepage": "https://github.com/YashrajBaila7/GPT2LM"
    },
    "split_keywords": [
        "gpt2lm",
        "gpt2lm",
        "gpt",
        "gpt",
        "language model",
        "yashraj baila",
        "ai",
        "artificial intelligence",
        "ml",
        "meachine learning",
        "deep learning",
        "gpt2",
        "gpt2",
        "gpt3",
        "gpt3",
        "gpt4",
        "gpt4",
        "chatgpt",
        "chatgpt",
        "chatgpt"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a92aae5dd7264054a2bf95d76e824d4d0a5844a8be554894b5a0e4f185042afd",
                "md5": "73d624cb5fae5b6388962669e0ddf02d",
                "sha256": "e7085c1250f0a23c03a3da6525a2df418ddc229801b9776b8ec9740354497bea"
            },
            "downloads": -1,
            "filename": "GPT2LM-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "73d624cb5fae5b6388962669e0ddf02d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 39355,
            "upload_time": "2024-02-16T15:31:16",
            "upload_time_iso_8601": "2024-02-16T15:31:16.896050Z",
            "url": "https://files.pythonhosted.org/packages/a9/2a/ae5dd7264054a2bf95d76e824d4d0a5844a8be554894b5a0e4f185042afd/GPT2LM-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "88b9ca0c3d43b61672fb59b000b8ccad8fab7b0774b6a5169ed7c46bfc7ed935",
                "md5": "44e45a6d017258b77bf542f34f5dd179",
                "sha256": "514d73619360837d4ecacb641ce53d05821b56cd886c8b0ac250169579503298"
            },
            "downloads": -1,
            "filename": "GPT2LM-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "44e45a6d017258b77bf542f34f5dd179",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 62328,
            "upload_time": "2024-02-16T15:31:18",
            "upload_time_iso_8601": "2024-02-16T15:31:18.573826Z",
            "url": "https://files.pythonhosted.org/packages/88/b9/ca0c3d43b61672fb59b000b8ccad8fab7b0774b6a5169ed7c46bfc7ed935/GPT2LM-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-16 15:31:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "YashrajBaila7",
    "github_project": "GPT2LM",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "gpt2lm"
}

Yashraj Baila