# buildNanoGPT
<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
> `buildNanoGPT` is developed based on Andrej Karpathy’s
> [build-nanoGPT](https://github.com/karpathy/build-nanoGPT) repo and
> [Let’s reproduce GPT-2
> (124M)](https://www.youtube.com/watch?v=l8pRSuU81PU) with added notes
> and details for teaching purposes using
> [nbdev](https://nbdev.fast.ai/), which enables package development,
> testing, documentation, and dissemination all in one place - Jupyter
> Notebook or Visual Studio Code Jupyter Notebook in my case 😄.
## Literate Programming
`buildNanoGPT`
``` mermaid
flowchart LR
A(Andrej's build-nanoGPT) --> C((Combination))
B(Jeremy's nbdev) --> C
C -->|Literate Programming| D(buildNanoGPT)
```
`micrograd2023`
<img src='media/literate_programming.svg' width=100% height=auto >
## Disclaimers
`buildNanoGPT` is written based on [Andrej
Karpathy](https://karpathy.ai/)’s github repo named
[build-nanoGPT](https://github.com/karpathy/makemore) and his [“Neural
Networks: Zero to
Hero”](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)
lecture series. Specifically the lecture called [Let’s reproduce GPT-2
(124M)](https://www.youtube.com/watch?v=l8pRSuU81PU).
Andrej is the man who needs no introduction in the field of Deep
Learning. He released a series of lectures called [Neural Network: Zero
to Hero](https://karpathy.ai/zero-to-hero.html), which I found extremely
educational and practical. I am reviewing the lectures and creating
notes for myself and for teaching purposes.
`buildNanoGPT` was written using [nbdev](https://nbdev.fast.ai/), which
was developed by [Jeremy Howard](https://jeremy.fast.ai/), the man who
also needs no introduction in the field of Deep Learning. Jeremy created
`fastai` Deep Learning software [library](https://docs.fast.ai/) and
[Courses](https://course.fast.ai/) that are extremely influential. I
highly recommend `fastai` if you are interested in starting your journey
and learning with ML and DL.
`nbdev` is a powerful tool that can be used to efficiently develop,
build, test, document, and distribute software packages all in one
place, Jupyter Notebook or Jupyter Notebooks in VS Code, which I am
using.
If you study lectures by Andrej and Jeremy you will probably notice that
they are both great educators and utilize both top-down and bottom-up
approaches in their teaching, but Andrej predominantly uses *bottom-up*
approach while Jeremy predominantly uses *top-down* one. I personally
fascinated by both educators and found values from both of them and hope
you are too!
## Usage
### Prepare FineWeb-Edu-10B data
``` python
from buildNanoGPT import data
import tiktoken
import numpy as np
```
``` python
enc = tiktoken.get_encoding("gpt2")
eot = enc._special_tokens['<|endoftext|>'] # end of text token
eot
```
50256
``` python
t_ref = [eot]
t_ref.extend(enc.encode("Hello, world!"))
t_ref = np.array(t_ref).astype(np.uint16)
t_ref
```
array([50256, 15496, 11, 995, 0], dtype=uint16)
``` python
t_ref = [eot]
t_ref.extend(enc.encode("Hello, world!"))
t_ref = np.array(t_ref).astype(np.int32)
t_ref
```
array([50256, 15496, 11, 995, 0], dtype=int32)
``` python
doc = {"text":"Hello, world!"}
t_test = data.tokenize(doc)
t_test
```
array([50256, 15496, 11, 995, 0], dtype=uint16)
``` python
assert np.all(t_ref == t_test)
```
``` python
# Download and Prepare the FineWeb-Edu-10B sample Data
data.edu_fineweb10B_prep(is_test=True)
```
Resolving data files: 0%| | 0/1630 [00:00<?, ?it/s]
Loading dataset shards: 0%| | 0/98 [00:00<?, ?it/s]
'Hello from `prepare_edu_fineweb10B()`! if you want to download the dataset, set is_test=False and run again.'
### Prepare HellaSwag Evaluation data
``` python
data.hellaswag_val_prep(is_test=True)
```
'Hello from `hellaswag_val_prep()`! if you want to download the dataset, set is_test=False and run again.'
### Load Pre-trained Weight
``` python
from buildNanoGPT.model import GPT, GPTConfig
from buildNanoGPT.train import DDPConfig, TrainingConfig, generate_text
import tiktoken
import torch
from torch.nn import functional as F
```
``` python
master_process = True
model = GPT.from_pretrained("gpt2", master_process)
```
loading weights from pretrained gpt: gpt2
``` python
enc = tiktoken.get_encoding('gpt2')
```
``` python
ddp_cf = DDPConfig()
model.to(ddp_cf.device)
```
using device: cuda
GPT(
(transformer): ModuleDict(
(wte): Embedding(50257, 768)
(wpe): Embedding(1024, 768)
(h): ModuleList(
(0-11): 12 x Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): CausalSelfAttention(
(c_attn): Linear(in_features=768, out_features=2304, bias=True)
(c_proj): Linear(in_features=768, out_features=768, bias=True)
)
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MLP(
(c_fc): Linear(in_features=768, out_features=3072, bias=True)
(gelu): GELU(approximate='tanh')
(c_proj): Linear(in_features=3072, out_features=768, bias=True)
)
)
)
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
``` python
generate_text(model, enc, ddp_cf)
```
rank 0 sample 0: Hello, I'm a language model, and I do not want to use some third-party file manager I used on my laptop. It would probably be easier
rank 0 sample 1: Hello, I'm a language model, not a problem solver. I should be writing. In the first book, I was in the trouble of proving that
rank 0 sample 2: Hello, I'm a language model, not a script," he said.
Banks and regulators will likely be wary of such a move, but for
rank 0 sample 3: Hello, I'm a language model, you must understand this.
So what really happened?
This article would be too short and concise. That
### Training
1. import modules and functions
``` python
from buildNanoGPT.train import train_GPT, set_random_seed
from buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite
from buildNanoGPT.train import DDPConfig, TrainingConfig, create_model
import torch
```
2. set seed for random number generator for reproducibility
``` python
set_random_seed(seed=1337) # for reproducibility
```
3. initiate DDP and Training configs - read the document and modify the
config parameters as desired
``` python
ddp_cf = DDPConfig()
```
using device: cuda
``` python
train_cf = TrainingConfig()
```
using device: cuda
4. setup train and validation dataloaders
``` python
train_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split='train')
val_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split="val")
```
found 99 shards for split train
found 1 shards for split val
5. set up the GPT model
``` python
model = create_model(ddp_cf)
```
6. train the GPT model
``` python
train_GPT(model, train_loader, val_loader, train_cf, ddp_cf)
```
total desired batch size: 524288
=> calculated gradient accumulation steps: 32
num decayed parameter tensors: 50, with 124,354,560 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True
validation loss: 10.9834
HellaSwag accuracy: 2534/10042=0.2523
step 0 | loss: 10.981724 | lr 6.0000e-05 | norm: 15.4339 | dt: 82819.52ms | tok/sec: 6330.49
step 1 | loss: 10.157787 | lr 1.2000e-04 | norm: 6.5679 | dt: 10668.81ms | tok/sec: 49142.14
step 2 | loss: 9.793260 | lr 1.8000e-04 | norm: 2.8270 | dt: 10747.73ms | tok/sec: 48781.28
step 3 | loss: 9.575678 | lr 2.4000e-04 | norm: 2.2934 | dt: 10789.36ms | tok/sec: 48593.07
step 4 | loss: 9.409717 | lr 3.0000e-04 | norm: 2.0182 | dt: 10883.30ms | tok/sec: 48173.61
step 5 | loss: 9.196922 | lr 3.6000e-04 | norm: 2.0160 | dt: 10734.89ms | tok/sec: 48839.61
step 6 | loss: 8.960140 | lr 4.2000e-04 | norm: 1.8684 | dt: 10902.57ms | tok/sec: 48088.46
step 7 | loss: 8.707756 | lr 4.8000e-04 | norm: 1.5884 | dt: 10851.94ms | tok/sec: 48312.84
step 8 | loss: 8.428266 | lr 5.4000e-04 | norm: 1.3737 | dt: 10883.36ms | tok/sec: 48173.34
step 9 | loss: 8.166906 | lr 6.0000e-04 | norm: 1.1468 | dt: 10797.07ms | tok/sec: 48558.35
step 10 | loss: 8.857561 | lr 6.0000e-04 | norm: 23.7457 | dt: 10755.35ms | tok/sec: 48746.74
step 11 | loss: 7.858195 | lr 5.8679e-04 | norm: 0.8712 | dt: 10667.08ms | tok/sec: 49150.09
step 12 | loss: 7.823021 | lr 5.4843e-04 | norm: 0.7075 | dt: 10793.02ms | tok/sec: 48576.59
step 13 | loss: 7.755527 | lr 4.8870e-04 | norm: 0.6744 | dt: 10827.16ms | tok/sec: 48423.42
step 14 | loss: 7.593850 | lr 4.1343e-04 | norm: 0.5836 | dt: 10730.71ms | tok/sec: 48858.64
step 15 | loss: 7.618423 | lr 3.3000e-04 | norm: 0.6430 | dt: 10648.68ms | tok/sec: 49235.03
step 16 | loss: 7.664069 | lr 2.4657e-04 | norm: 0.5456 | dt: 10749.31ms | tok/sec: 48774.10
step 17 | loss: 7.603458 | lr 1.7130e-04 | norm: 0.6211 | dt: 10837.78ms | tok/sec: 48375.97
step 18 | loss: 7.809735 | lr 1.1157e-04 | norm: 0.4929 | dt: 10698.80ms | tok/sec: 49004.37
validation loss: 7.6044
HellaSwag accuracy: 2448/10042=0.2438
rank 0 sample 0: Hello, I'm a language model,:
the on a a in is at on in� and are you in the to their for and in the a
rank 0 sample 1: Hello, I'm a language model,� or an, and or and �, and you by are in
to a of or. ( of the to
rank 0 sample 2: Hello, I'm a language model,.
or:
the an-, withs,- and to the a.
, who, and�
rank 0 sample 3: Hello, I'm a language model, a by� to, for. that of they-, which are for and can- be.
of:)
step 19 | loss: 7.893970 | lr 7.3215e-05 | norm: 0.6688 | dt: 85602.68ms | tok/sec: 6124.67
### Load Checkpoint
``` python
from buildNanoGPT.train import train_GPT, set_random_seed
from buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite
from buildNanoGPT.train import DDPConfig, TrainingConfig, create_model, generate_text
import torch
import tiktoken
```
1. set up the GPT model
``` python
ddp_cf = DDPConfig()
model = create_model(ddp_cf)
```
using device: cuda
2. load the model weights from the saved checkpoint
``` python
model_checkpoint = torch.load("log/model_00019.pt")
checkpoint_state_dict = model_checkpoint['model']
model.load_state_dict(checkpoint_state_dict)
```
<All keys matched successfully>
3. generate text from saved weights
``` python
enc = tiktoken.get_encoding('gpt2')
generate_text(model, enc, ddp_cf)
```
rank 0 sample 0: Hello, I'm a language model,:
the on a a in is at on in� and are you in the to their for and in the a
rank 0 sample 1: Hello, I'm a language model,� or an, and or and �, and you by are in
to a of or. ( of the to
rank 0 sample 2: Hello, I'm a language model,.
or:
the an-, withs,- and to the a.
, who, and�
rank 0 sample 3: Hello, I'm a language model, a by� to, for. that of they-, which are for and can- be.
of:)
### Fine-tune from OpenAI’s weights
``` python
from buildNanoGPT.train import train_GPT, set_random_seed
from buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite
from buildNanoGPT.train import DDPConfig, TrainingConfig, create_model, generate_text
import torch
import tiktoken
```
1. load OpenAI’s pre-trained weights
``` python
ddp_cf = DDPConfig()
model_fine = GPT.from_pretrained("gpt2", ddp_cf.master_process)
model_fine.to(ddp_cf.device)
```
using device: cuda
loading weights from pretrained gpt: gpt2
GPT(
(transformer): ModuleDict(
(wte): Embedding(50257, 768)
(wpe): Embedding(1024, 768)
(h): ModuleList(
(0-11): 12 x Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): CausalSelfAttention(
(c_attn): Linear(in_features=768, out_features=2304, bias=True)
(c_proj): Linear(in_features=768, out_features=768, bias=True)
)
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MLP(
(c_fc): Linear(in_features=768, out_features=3072, bias=True)
(gelu): GELU(approximate='tanh')
(c_proj): Linear(in_features=3072, out_features=768, bias=True)
)
)
)
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
2. set seed for reproducibility
``` python
set_random_seed(seed=1337) # for reproducibility
```
3. set up training parameters - set `max_lr` to a small number since it
is a fine-tuning step. More advance fine-tuning may include
supervised fine-tuning (SFT) using custom data and finer control on
which layer has more or less fine-tuning effects.
``` python
train_cf = TrainingConfig(max_lr=1e-6)
```
using device: cuda
4. set up train and validation data-loaders
``` python
train_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split='train')
val_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split="val")
```
found 99 shards for split train
found 1 shards for split val
5. fine tuning the model
``` python
train_GPT(model_fine, train_loader, val_loader, train_cf, ddp_cf)
```
total desired batch size: 524288
=> calculated gradient accumulation steps: 32
num decayed parameter tensors: 50, with 124,318,464 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True
validation loss: 3.2530
HellaSwag accuracy: 2970/10042=0.2958
step 0 | loss: 3.279157 | lr 1.0000e-07 | norm: 2.3655 | dt: 80251.91ms | tok/sec: 6533.03
step 1 | loss: 3.322400 | lr 2.0000e-07 | norm: 2.3916 | dt: 10466.55ms | tok/sec: 50091.77
step 2 | loss: 3.310521 | lr 3.0000e-07 | norm: 2.5691 | dt: 10404.72ms | tok/sec: 50389.42
step 3 | loss: 3.403320 | lr 4.0000e-07 | norm: 2.5293 | dt: 10539.22ms | tok/sec: 49746.40
step 4 | loss: 3.280189 | lr 5.0000e-07 | norm: 2.5589 | dt: 10462.80ms | tok/sec: 50109.70
step 5 | loss: 3.341536 | lr 6.0000e-07 | norm: 2.4456 | dt: 10489.14ms | tok/sec: 49983.90
step 6 | loss: 3.388632 | lr 7.0000e-07 | norm: 2.3444 | dt: 10656.34ms | tok/sec: 49199.62
step 7 | loss: 3.336595 | lr 8.0000e-07 | norm: 2.4381 | dt: 10750.67ms | tok/sec: 48767.94
step 8 | loss: 3.358722 | lr 9.0000e-07 | norm: 2.0390 | dt: 10728.56ms | tok/sec: 48868.44
step 9 | loss: 3.303847 | lr 1.0000e-06 | norm: 2.5693 | dt: 10549.71ms | tok/sec: 49696.89
step 10 | loss: 3.338424 | lr 1.0000e-06 | norm: 2.5449 | dt: 10565.95ms | tok/sec: 49620.54
step 11 | loss: 3.326447 | lr 9.7798e-07 | norm: 2.2862 | dt: 10577.53ms | tok/sec: 49566.18
step 12 | loss: 3.297659 | lr 9.1406e-07 | norm: 2.2453 | dt: 10640.80ms | tok/sec: 49271.47
step 13 | loss: 3.298663 | lr 8.1450e-07 | norm: 2.2228 | dt: 10551.25ms | tok/sec: 49689.67
step 14 | loss: 3.304088 | lr 6.8906e-07 | norm: 2.5593 | dt: 10415.45ms | tok/sec: 50337.54
step 15 | loss: 3.373518 | lr 5.5000e-07 | norm: 2.3321 | dt: 10446.78ms | tok/sec: 50186.59
step 16 | loss: 3.314626 | lr 4.1094e-07 | norm: 2.3768 | dt: 10416.73ms | tok/sec: 50331.33
step 17 | loss: 3.331042 | lr 2.8550e-07 | norm: 2.1369 | dt: 10248.14ms | tok/sec: 51159.35
step 18 | loss: 3.334763 | lr 1.8594e-07 | norm: 1.8012 | dt: 10206.37ms | tok/sec: 51368.71
validation loss: 3.2394
HellaSwag accuracy: 2959/10042=0.2947
rank 0 sample 0: Hello, I'm a language model, and I know how it works: You, to my knowledge, invented Java!
We all do the same stuff
rank 0 sample 1: Hello, I'm a language model, not a function. It's the last thing that works here, I guess. I think this is very much a misunderstanding
rank 0 sample 2: Hello, I'm a language model, not a writing language. Let's use a syntax like this (which is a bit different from the one in C):
rank 0 sample 3: Hello, I'm a language model, you and I can talk about it!" He also said that he doesn't want to use other people's language, nor
step 19 | loss: 3.189983 | lr 1.2202e-07 | norm: 1.9916 | dt: 80862.14ms | tok/sec: 6483.73
### Visualize the Loss
``` python
from buildNanoGPT.viz import plot_log
```
``` python
plot_log(log_file='log/log_6500steps.txt', sz='124M')
```
Min Train Loss: 2.997356
Min Validation Loss: 3.275
Max Hellaswag eval: 0.2782

## How to install
The [buildNanoGPT](https://pypi.org/project/buildNanoGPT/) package was
uploaded to [PyPI](https://pypi.org/) and can be easily installed using
the below command.
`pip install buildNanoGPT`
### Developer install
If you want to develop `buildNanoGPT` yourself, please use an editable
installation.
`git clone https://github.com/hdocmsu/buildNanoGPT.git`
`pip install -e "buildNanoGPT[dev]"`
You also need to use an editable installation of
[nbdev](https://github.com/fastai/nbdev),
[fastcore](https://github.com/fastai/fastcore), and
[execnb](https://github.com/fastai/execnb).
Happy Coding!!!
<div class="alert alert-info">
<b>Note:</b> `buildNanoGPT` is currently Work in Progress (WIP).
</div>
Raw data
{
"_id": null,
"home_page": "https://github.com/hdocmsu/buildNanoGPT/",
"name": "buildNanoGPT",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "nbdev",
"author": "Hung Do, PhD",
"author_email": "clinicalcollaborations@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/bf/70/059b79cc1c5e01cf2ca8d9c12c6edbd7a7b2295fa59e0a1c996bdda8088f/buildnanogpt-0.1.5.tar.gz",
"platform": null,
"description": "# buildNanoGPT\n\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\n> `buildNanoGPT` is developed based on Andrej Karpathy\u2019s\n> [build-nanoGPT](https://github.com/karpathy/build-nanoGPT) repo and\n> [Let\u2019s reproduce GPT-2\n> (124M)](https://www.youtube.com/watch?v=l8pRSuU81PU) with added notes\n> and details for teaching purposes using\n> [nbdev](https://nbdev.fast.ai/), which enables package development,\n> testing, documentation, and dissemination all in one place - Jupyter\n> Notebook or Visual Studio Code Jupyter Notebook in my case \ud83d\ude04.\n\n## Literate Programming\n\n`buildNanoGPT`\n\n``` mermaid\nflowchart LR\n A(Andrej's build-nanoGPT) --> C((Combination))\n B(Jeremy's nbdev) --> C\n C -->|Literate Programming| D(buildNanoGPT)\n```\n\n`micrograd2023`\n\n<img src='media/literate_programming.svg' width=100% height=auto >\n\n## Disclaimers\n\n`buildNanoGPT` is written based on [Andrej\nKarpathy](https://karpathy.ai/)\u2019s github repo named\n[build-nanoGPT](https://github.com/karpathy/makemore) and his [\u201cNeural\nNetworks: Zero to\nHero\u201d](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)\nlecture series. Specifically the lecture called [Let\u2019s reproduce GPT-2\n(124M)](https://www.youtube.com/watch?v=l8pRSuU81PU).\n\nAndrej is the man who needs no introduction in the field of Deep\nLearning. He released a series of lectures called [Neural Network: Zero\nto Hero](https://karpathy.ai/zero-to-hero.html), which I found extremely\neducational and practical. I am reviewing the lectures and creating\nnotes for myself and for teaching purposes.\n\n`buildNanoGPT` was written using [nbdev](https://nbdev.fast.ai/), which\nwas developed by [Jeremy Howard](https://jeremy.fast.ai/), the man who\nalso needs no introduction in the field of Deep Learning. Jeremy created\n`fastai` Deep Learning software [library](https://docs.fast.ai/) and\n[Courses](https://course.fast.ai/) that are extremely influential. I\nhighly recommend `fastai` if you are interested in starting your journey\nand learning with ML and DL.\n\n`nbdev` is a powerful tool that can be used to efficiently develop,\nbuild, test, document, and distribute software packages all in one\nplace, Jupyter Notebook or Jupyter Notebooks in VS Code, which I am\nusing.\n\nIf you study lectures by Andrej and Jeremy you will probably notice that\nthey are both great educators and utilize both top-down and bottom-up\napproaches in their teaching, but Andrej predominantly uses *bottom-up*\napproach while Jeremy predominantly uses *top-down* one. I personally\nfascinated by both educators and found values from both of them and hope\nyou are too!\n\n## Usage\n\n### Prepare FineWeb-Edu-10B data\n\n``` python\nfrom buildNanoGPT import data\nimport tiktoken\nimport numpy as np\n```\n\n``` python\nenc = tiktoken.get_encoding(\"gpt2\")\neot = enc._special_tokens['<|endoftext|>'] # end of text token\neot\n```\n\n 50256\n\n``` python\nt_ref = [eot]\nt_ref.extend(enc.encode(\"Hello, world!\"))\nt_ref = np.array(t_ref).astype(np.uint16)\nt_ref\n```\n\n array([50256, 15496, 11, 995, 0], dtype=uint16)\n\n``` python\nt_ref = [eot]\nt_ref.extend(enc.encode(\"Hello, world!\"))\nt_ref = np.array(t_ref).astype(np.int32)\nt_ref\n```\n\n array([50256, 15496, 11, 995, 0], dtype=int32)\n\n``` python\ndoc = {\"text\":\"Hello, world!\"}\nt_test = data.tokenize(doc)\nt_test\n```\n\n array([50256, 15496, 11, 995, 0], dtype=uint16)\n\n``` python\nassert np.all(t_ref == t_test)\n```\n\n``` python\n# Download and Prepare the FineWeb-Edu-10B sample Data\ndata.edu_fineweb10B_prep(is_test=True)\n```\n\n Resolving data files: 0%| | 0/1630 [00:00<?, ?it/s]\n\n Loading dataset shards: 0%| | 0/98 [00:00<?, ?it/s]\n\n 'Hello from `prepare_edu_fineweb10B()`! if you want to download the dataset, set is_test=False and run again.'\n\n### Prepare HellaSwag Evaluation data\n\n``` python\ndata.hellaswag_val_prep(is_test=True)\n```\n\n 'Hello from `hellaswag_val_prep()`! if you want to download the dataset, set is_test=False and run again.'\n\n### Load Pre-trained Weight\n\n``` python\nfrom buildNanoGPT.model import GPT, GPTConfig\nfrom buildNanoGPT.train import DDPConfig, TrainingConfig, generate_text\nimport tiktoken\nimport torch\nfrom torch.nn import functional as F\n```\n\n``` python\nmaster_process = True\nmodel = GPT.from_pretrained(\"gpt2\", master_process)\n```\n\n loading weights from pretrained gpt: gpt2\n\n``` python\nenc = tiktoken.get_encoding('gpt2')\n```\n\n``` python\nddp_cf = DDPConfig()\nmodel.to(ddp_cf.device)\n```\n\n using device: cuda\n\n GPT(\n (transformer): ModuleDict(\n (wte): Embedding(50257, 768)\n (wpe): Embedding(1024, 768)\n (h): ModuleList(\n (0-11): 12 x Block(\n (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n (attn): CausalSelfAttention(\n (c_attn): Linear(in_features=768, out_features=2304, bias=True)\n (c_proj): Linear(in_features=768, out_features=768, bias=True)\n )\n (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n (mlp): MLP(\n (c_fc): Linear(in_features=768, out_features=3072, bias=True)\n (gelu): GELU(approximate='tanh')\n (c_proj): Linear(in_features=3072, out_features=768, bias=True)\n )\n )\n )\n (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n )\n (lm_head): Linear(in_features=768, out_features=50257, bias=False)\n )\n\n``` python\ngenerate_text(model, enc, ddp_cf)\n```\n\n rank 0 sample 0: Hello, I'm a language model, and I do not want to use some third-party file manager I used on my laptop. It would probably be easier\n rank 0 sample 1: Hello, I'm a language model, not a problem solver. I should be writing. In the first book, I was in the trouble of proving that\n rank 0 sample 2: Hello, I'm a language model, not a script,\" he said.\n\n Banks and regulators will likely be wary of such a move, but for\n rank 0 sample 3: Hello, I'm a language model, you must understand this.\n\n So what really happened?\n\n This article would be too short and concise. That\n\n### Training\n\n1. import modules and functions\n\n``` python\nfrom buildNanoGPT.train import train_GPT, set_random_seed\nfrom buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite\nfrom buildNanoGPT.train import DDPConfig, TrainingConfig, create_model\nimport torch\n```\n\n2. set seed for random number generator for reproducibility\n\n``` python\nset_random_seed(seed=1337) # for reproducibility\n```\n\n3. initiate DDP and Training configs - read the document and modify the\n config parameters as desired\n\n``` python\nddp_cf = DDPConfig()\n```\n\n using device: cuda\n\n``` python\ntrain_cf = TrainingConfig()\n```\n\n using device: cuda\n\n4. setup train and validation dataloaders\n\n``` python\ntrain_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split='train')\nval_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split=\"val\")\n```\n\n found 99 shards for split train\n found 1 shards for split val\n\n5. set up the GPT model\n\n``` python\nmodel = create_model(ddp_cf)\n```\n\n6. train the GPT model\n\n``` python\ntrain_GPT(model, train_loader, val_loader, train_cf, ddp_cf)\n```\n\n total desired batch size: 524288\n => calculated gradient accumulation steps: 32\n num decayed parameter tensors: 50, with 124,354,560 parameters\n num non-decayed parameter tensors: 98, with 121,344 parameters\n using fused AdamW: True\n validation loss: 10.9834\n HellaSwag accuracy: 2534/10042=0.2523\n step 0 | loss: 10.981724 | lr 6.0000e-05 | norm: 15.4339 | dt: 82819.52ms | tok/sec: 6330.49\n step 1 | loss: 10.157787 | lr 1.2000e-04 | norm: 6.5679 | dt: 10668.81ms | tok/sec: 49142.14\n step 2 | loss: 9.793260 | lr 1.8000e-04 | norm: 2.8270 | dt: 10747.73ms | tok/sec: 48781.28\n step 3 | loss: 9.575678 | lr 2.4000e-04 | norm: 2.2934 | dt: 10789.36ms | tok/sec: 48593.07\n step 4 | loss: 9.409717 | lr 3.0000e-04 | norm: 2.0182 | dt: 10883.30ms | tok/sec: 48173.61\n step 5 | loss: 9.196922 | lr 3.6000e-04 | norm: 2.0160 | dt: 10734.89ms | tok/sec: 48839.61\n step 6 | loss: 8.960140 | lr 4.2000e-04 | norm: 1.8684 | dt: 10902.57ms | tok/sec: 48088.46\n step 7 | loss: 8.707756 | lr 4.8000e-04 | norm: 1.5884 | dt: 10851.94ms | tok/sec: 48312.84\n step 8 | loss: 8.428266 | lr 5.4000e-04 | norm: 1.3737 | dt: 10883.36ms | tok/sec: 48173.34\n step 9 | loss: 8.166906 | lr 6.0000e-04 | norm: 1.1468 | dt: 10797.07ms | tok/sec: 48558.35\n step 10 | loss: 8.857561 | lr 6.0000e-04 | norm: 23.7457 | dt: 10755.35ms | tok/sec: 48746.74\n step 11 | loss: 7.858195 | lr 5.8679e-04 | norm: 0.8712 | dt: 10667.08ms | tok/sec: 49150.09\n step 12 | loss: 7.823021 | lr 5.4843e-04 | norm: 0.7075 | dt: 10793.02ms | tok/sec: 48576.59\n step 13 | loss: 7.755527 | lr 4.8870e-04 | norm: 0.6744 | dt: 10827.16ms | tok/sec: 48423.42\n step 14 | loss: 7.593850 | lr 4.1343e-04 | norm: 0.5836 | dt: 10730.71ms | tok/sec: 48858.64\n step 15 | loss: 7.618423 | lr 3.3000e-04 | norm: 0.6430 | dt: 10648.68ms | tok/sec: 49235.03\n step 16 | loss: 7.664069 | lr 2.4657e-04 | norm: 0.5456 | dt: 10749.31ms | tok/sec: 48774.10\n step 17 | loss: 7.603458 | lr 1.7130e-04 | norm: 0.6211 | dt: 10837.78ms | tok/sec: 48375.97\n step 18 | loss: 7.809735 | lr 1.1157e-04 | norm: 0.4929 | dt: 10698.80ms | tok/sec: 49004.37\n validation loss: 7.6044\n HellaSwag accuracy: 2448/10042=0.2438\n rank 0 sample 0: Hello, I'm a language model,:\n the on a a in is at on in\ufffd and are you in the to their for and in the a\n rank 0 sample 1: Hello, I'm a language model,\ufffd or an, and or and \ufffd, and you by are in\n to a of or. ( of the to\n rank 0 sample 2: Hello, I'm a language model,.\n or:\n the an-, withs,- and to the a.\n , who, and\ufffd\n rank 0 sample 3: Hello, I'm a language model, a by\ufffd to, for. that of they-, which are for and can- be.\n of:)\n step 19 | loss: 7.893970 | lr 7.3215e-05 | norm: 0.6688 | dt: 85602.68ms | tok/sec: 6124.67\n\n### Load Checkpoint\n\n``` python\nfrom buildNanoGPT.train import train_GPT, set_random_seed\nfrom buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite\nfrom buildNanoGPT.train import DDPConfig, TrainingConfig, create_model, generate_text\nimport torch\nimport tiktoken\n```\n\n1. set up the GPT model\n\n``` python\nddp_cf = DDPConfig()\nmodel = create_model(ddp_cf)\n```\n\n using device: cuda\n\n2. load the model weights from the saved checkpoint\n\n``` python\nmodel_checkpoint = torch.load(\"log/model_00019.pt\")\ncheckpoint_state_dict = model_checkpoint['model']\nmodel.load_state_dict(checkpoint_state_dict)\n```\n\n <All keys matched successfully>\n\n3. generate text from saved weights\n\n``` python\nenc = tiktoken.get_encoding('gpt2')\ngenerate_text(model, enc, ddp_cf)\n```\n\n rank 0 sample 0: Hello, I'm a language model,:\n the on a a in is at on in\ufffd and are you in the to their for and in the a\n rank 0 sample 1: Hello, I'm a language model,\ufffd or an, and or and \ufffd, and you by are in\n to a of or. ( of the to\n rank 0 sample 2: Hello, I'm a language model,.\n or:\n the an-, withs,- and to the a.\n , who, and\ufffd\n rank 0 sample 3: Hello, I'm a language model, a by\ufffd to, for. that of they-, which are for and can- be.\n of:)\n\n### Fine-tune from OpenAI\u2019s weights\n\n``` python\nfrom buildNanoGPT.train import train_GPT, set_random_seed\nfrom buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite\nfrom buildNanoGPT.train import DDPConfig, TrainingConfig, create_model, generate_text\nimport torch\nimport tiktoken\n```\n\n1. load OpenAI\u2019s pre-trained weights\n\n``` python\nddp_cf = DDPConfig()\nmodel_fine = GPT.from_pretrained(\"gpt2\", ddp_cf.master_process)\nmodel_fine.to(ddp_cf.device)\n```\n\n using device: cuda\n loading weights from pretrained gpt: gpt2\n\n GPT(\n (transformer): ModuleDict(\n (wte): Embedding(50257, 768)\n (wpe): Embedding(1024, 768)\n (h): ModuleList(\n (0-11): 12 x Block(\n (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n (attn): CausalSelfAttention(\n (c_attn): Linear(in_features=768, out_features=2304, bias=True)\n (c_proj): Linear(in_features=768, out_features=768, bias=True)\n )\n (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n (mlp): MLP(\n (c_fc): Linear(in_features=768, out_features=3072, bias=True)\n (gelu): GELU(approximate='tanh')\n (c_proj): Linear(in_features=3072, out_features=768, bias=True)\n )\n )\n )\n (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n )\n (lm_head): Linear(in_features=768, out_features=50257, bias=False)\n )\n\n2. set seed for reproducibility\n\n``` python\nset_random_seed(seed=1337) # for reproducibility\n```\n\n3. set up training parameters - set `max_lr` to a small number since it\n is a fine-tuning step. More advance fine-tuning may include\n supervised fine-tuning (SFT) using custom data and finer control on\n which layer has more or less fine-tuning effects.\n\n``` python\ntrain_cf = TrainingConfig(max_lr=1e-6)\n```\n\n using device: cuda\n\n4. set up train and validation data-loaders\n\n``` python\ntrain_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split='train')\nval_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split=\"val\")\n```\n\n found 99 shards for split train\n found 1 shards for split val\n\n5. fine tuning the model\n\n``` python\ntrain_GPT(model_fine, train_loader, val_loader, train_cf, ddp_cf)\n```\n\n total desired batch size: 524288\n => calculated gradient accumulation steps: 32\n num decayed parameter tensors: 50, with 124,318,464 parameters\n num non-decayed parameter tensors: 98, with 121,344 parameters\n using fused AdamW: True\n validation loss: 3.2530\n HellaSwag accuracy: 2970/10042=0.2958\n step 0 | loss: 3.279157 | lr 1.0000e-07 | norm: 2.3655 | dt: 80251.91ms | tok/sec: 6533.03\n step 1 | loss: 3.322400 | lr 2.0000e-07 | norm: 2.3916 | dt: 10466.55ms | tok/sec: 50091.77\n step 2 | loss: 3.310521 | lr 3.0000e-07 | norm: 2.5691 | dt: 10404.72ms | tok/sec: 50389.42\n step 3 | loss: 3.403320 | lr 4.0000e-07 | norm: 2.5293 | dt: 10539.22ms | tok/sec: 49746.40\n step 4 | loss: 3.280189 | lr 5.0000e-07 | norm: 2.5589 | dt: 10462.80ms | tok/sec: 50109.70\n step 5 | loss: 3.341536 | lr 6.0000e-07 | norm: 2.4456 | dt: 10489.14ms | tok/sec: 49983.90\n step 6 | loss: 3.388632 | lr 7.0000e-07 | norm: 2.3444 | dt: 10656.34ms | tok/sec: 49199.62\n step 7 | loss: 3.336595 | lr 8.0000e-07 | norm: 2.4381 | dt: 10750.67ms | tok/sec: 48767.94\n step 8 | loss: 3.358722 | lr 9.0000e-07 | norm: 2.0390 | dt: 10728.56ms | tok/sec: 48868.44\n step 9 | loss: 3.303847 | lr 1.0000e-06 | norm: 2.5693 | dt: 10549.71ms | tok/sec: 49696.89\n step 10 | loss: 3.338424 | lr 1.0000e-06 | norm: 2.5449 | dt: 10565.95ms | tok/sec: 49620.54\n step 11 | loss: 3.326447 | lr 9.7798e-07 | norm: 2.2862 | dt: 10577.53ms | tok/sec: 49566.18\n step 12 | loss: 3.297659 | lr 9.1406e-07 | norm: 2.2453 | dt: 10640.80ms | tok/sec: 49271.47\n step 13 | loss: 3.298663 | lr 8.1450e-07 | norm: 2.2228 | dt: 10551.25ms | tok/sec: 49689.67\n step 14 | loss: 3.304088 | lr 6.8906e-07 | norm: 2.5593 | dt: 10415.45ms | tok/sec: 50337.54\n step 15 | loss: 3.373518 | lr 5.5000e-07 | norm: 2.3321 | dt: 10446.78ms | tok/sec: 50186.59\n step 16 | loss: 3.314626 | lr 4.1094e-07 | norm: 2.3768 | dt: 10416.73ms | tok/sec: 50331.33\n step 17 | loss: 3.331042 | lr 2.8550e-07 | norm: 2.1369 | dt: 10248.14ms | tok/sec: 51159.35\n step 18 | loss: 3.334763 | lr 1.8594e-07 | norm: 1.8012 | dt: 10206.37ms | tok/sec: 51368.71\n validation loss: 3.2394\n HellaSwag accuracy: 2959/10042=0.2947\n rank 0 sample 0: Hello, I'm a language model, and I know how it works: You, to my knowledge, invented Java!\n\n We all do the same stuff\n rank 0 sample 1: Hello, I'm a language model, not a function. It's the last thing that works here, I guess. I think this is very much a misunderstanding\n rank 0 sample 2: Hello, I'm a language model, not a writing language. Let's use a syntax like this (which is a bit different from the one in C):\n rank 0 sample 3: Hello, I'm a language model, you and I can talk about it!\" He also said that he doesn't want to use other people's language, nor\n step 19 | loss: 3.189983 | lr 1.2202e-07 | norm: 1.9916 | dt: 80862.14ms | tok/sec: 6483.73\n\n### Visualize the Loss\n\n``` python\nfrom buildNanoGPT.viz import plot_log\n```\n\n``` python\nplot_log(log_file='log/log_6500steps.txt', sz='124M')\n```\n\n Min Train Loss: 2.997356\n Min Validation Loss: 3.275\n Max Hellaswag eval: 0.2782\n\n\n\n## How to install\n\nThe [buildNanoGPT](https://pypi.org/project/buildNanoGPT/) package was\nuploaded to [PyPI](https://pypi.org/) and can be easily installed using\nthe below command.\n\n`pip install buildNanoGPT`\n\n### Developer install\n\nIf you want to develop `buildNanoGPT` yourself, please use an editable\ninstallation.\n\n`git clone https://github.com/hdocmsu/buildNanoGPT.git`\n\n`pip install -e \"buildNanoGPT[dev]\"`\n\nYou also need to use an editable installation of\n[nbdev](https://github.com/fastai/nbdev),\n[fastcore](https://github.com/fastai/fastcore), and\n[execnb](https://github.com/fastai/execnb).\n\nHappy Coding!!!\n\n<div class=\"alert alert-info\">\n\n<b>Note:</b> `buildNanoGPT` is currently Work in Progress (WIP).\n\n</div>\n",
"bugtrack_url": null,
"license": "Apache Software License 2.0",
"summary": "A template for nbdev-based project",
"version": "0.1.5",
"project_urls": {
"Homepage": "https://github.com/hdocmsu/buildNanoGPT/"
},
"split_keywords": [
"nbdev"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ee3e2c9152e5201218dfba45852e14512cf3388797cb840d676d627eaa5978fa",
"md5": "43462294e261c369e153dd1b1e00affd",
"sha256": "2c129846c4dbc4c1a05ef438b59f6669e4456b64c7c1815e46e8f26d2818c95c"
},
"downloads": -1,
"filename": "buildNanoGPT-0.1.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "43462294e261c369e153dd1b1e00affd",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 27442,
"upload_time": "2024-07-09T09:02:50",
"upload_time_iso_8601": "2024-07-09T09:02:50.127996Z",
"url": "https://files.pythonhosted.org/packages/ee/3e/2c9152e5201218dfba45852e14512cf3388797cb840d676d627eaa5978fa/buildNanoGPT-0.1.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "bf70059b79cc1c5e01cf2ca8d9c12c6edbd7a7b2295fa59e0a1c996bdda8088f",
"md5": "22d9e7f5b6123d9e569fc7c4a996d5ec",
"sha256": "9c5ead32cfc44b42462d96d78d507ec88a9588ca61dc65447c494b5014987aa6"
},
"downloads": -1,
"filename": "buildnanogpt-0.1.5.tar.gz",
"has_sig": false,
"md5_digest": "22d9e7f5b6123d9e569fc7c4a996d5ec",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 30950,
"upload_time": "2024-07-09T09:02:51",
"upload_time_iso_8601": "2024-07-09T09:02:51.861306Z",
"url": "https://files.pythonhosted.org/packages/bf/70/059b79cc1c5e01cf2ca8d9c12c6edbd7a7b2295fa59e0a1c996bdda8088f/buildnanogpt-0.1.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-09 09:02:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "hdocmsu",
"github_project": "buildNanoGPT",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "buildnanogpt"
}