buildNanoGPT

Name	buildNanoGPT JSON
Version	0.1.5 JSON
	download
home_page	https://github.com/hdocmsu/buildNanoGPT/
Summary	A template for nbdev-based project
upload_time	2024-07-09 09:02:51
maintainer	None
docs_url	None
author	Hung Do, PhD
requires_python	>=3.7
license	Apache Software License 2.0
keywords	nbdev
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # buildNanoGPT


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

> `buildNanoGPT` is developed based on Andrej Karpathy’s
> [build-nanoGPT](https://github.com/karpathy/build-nanoGPT) repo and
> [Let’s reproduce GPT-2
> (124M)](https://www.youtube.com/watch?v=l8pRSuU81PU) with added notes
> and details for teaching purposes using
> [nbdev](https://nbdev.fast.ai/), which enables package development,
> testing, documentation, and dissemination all in one place - Jupyter
> Notebook or Visual Studio Code Jupyter Notebook in my case 😄.

## Literate Programming

`buildNanoGPT`

``` mermaid
flowchart LR
  A(Andrej's build-nanoGPT) --> C((Combination))
  B(Jeremy's nbdev) --> C
  C -->|Literate Programming| D(buildNanoGPT)
```

`micrograd2023`

<img src='media/literate_programming.svg' width=100% height=auto >

## Disclaimers

`buildNanoGPT` is written based on [Andrej
Karpathy](https://karpathy.ai/)’s github repo named
[build-nanoGPT](https://github.com/karpathy/makemore) and his [“Neural
Networks: Zero to
Hero”](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)
lecture series. Specifically the lecture called [Let’s reproduce GPT-2
(124M)](https://www.youtube.com/watch?v=l8pRSuU81PU).

Andrej is the man who needs no introduction in the field of Deep
Learning. He released a series of lectures called [Neural Network: Zero
to Hero](https://karpathy.ai/zero-to-hero.html), which I found extremely
educational and practical. I am reviewing the lectures and creating
notes for myself and for teaching purposes.

`buildNanoGPT` was written using [nbdev](https://nbdev.fast.ai/), which
was developed by [Jeremy Howard](https://jeremy.fast.ai/), the man who
also needs no introduction in the field of Deep Learning. Jeremy created
`fastai` Deep Learning software [library](https://docs.fast.ai/) and
[Courses](https://course.fast.ai/) that are extremely influential. I
highly recommend `fastai` if you are interested in starting your journey
and learning with ML and DL.

`nbdev` is a powerful tool that can be used to efficiently develop,
build, test, document, and distribute software packages all in one
place, Jupyter Notebook or Jupyter Notebooks in VS Code, which I am
using.

If you study lectures by Andrej and Jeremy you will probably notice that
they are both great educators and utilize both top-down and bottom-up
approaches in their teaching, but Andrej predominantly uses *bottom-up*
approach while Jeremy predominantly uses *top-down* one. I personally
fascinated by both educators and found values from both of them and hope
you are too!

## Usage

### Prepare FineWeb-Edu-10B data

``` python
from buildNanoGPT import data
import tiktoken
import numpy as np
```

``` python
enc = tiktoken.get_encoding("gpt2")
eot = enc._special_tokens['<|endoftext|>'] # end of text token
eot
```

    50256

``` python
t_ref = [eot]
t_ref.extend(enc.encode("Hello, world!"))
t_ref = np.array(t_ref).astype(np.uint16)
t_ref
```

    array([50256, 15496,    11,   995,     0], dtype=uint16)

``` python
t_ref = [eot]
t_ref.extend(enc.encode("Hello, world!"))
t_ref = np.array(t_ref).astype(np.int32)
t_ref
```

    array([50256, 15496,    11,   995,     0], dtype=int32)

``` python
doc = {"text":"Hello, world!"}
t_test = data.tokenize(doc)
t_test
```

    array([50256, 15496,    11,   995,     0], dtype=uint16)

``` python
assert np.all(t_ref == t_test)
```

``` python
# Download and Prepare the FineWeb-Edu-10B sample Data
data.edu_fineweb10B_prep(is_test=True)
```

    Resolving data files:   0%|          | 0/1630 [00:00<?, ?it/s]

    Loading dataset shards:   0%|          | 0/98 [00:00<?, ?it/s]

    'Hello from `prepare_edu_fineweb10B()`! if you want to download the dataset, set is_test=False and run again.'

### Prepare HellaSwag Evaluation data

``` python
data.hellaswag_val_prep(is_test=True)
```

    'Hello from `hellaswag_val_prep()`! if you want to download the dataset, set is_test=False and run again.'

### Load Pre-trained Weight

``` python
from buildNanoGPT.model import GPT, GPTConfig
from buildNanoGPT.train import DDPConfig, TrainingConfig, generate_text
import tiktoken
import torch
from torch.nn import functional as F
```

``` python
master_process = True
model = GPT.from_pretrained("gpt2", master_process)
```

    loading weights from pretrained gpt: gpt2

``` python
enc = tiktoken.get_encoding('gpt2')
```

``` python
ddp_cf = DDPConfig()
model.to(ddp_cf.device)
```

    using device: cuda

    GPT(
      (transformer): ModuleDict(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (h): ModuleList(
          (0-11): 12 x Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): CausalSelfAttention(
              (c_attn): Linear(in_features=768, out_features=2304, bias=True)
              (c_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (mlp): MLP(
              (c_fc): Linear(in_features=768, out_features=3072, bias=True)
              (gelu): GELU(approximate='tanh')
              (c_proj): Linear(in_features=3072, out_features=768, bias=True)
            )
          )
        )
        (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      )
      (lm_head): Linear(in_features=768, out_features=50257, bias=False)
    )

``` python
generate_text(model, enc, ddp_cf)
```

    rank 0 sample 0: Hello, I'm a language model, and I do not want to use some third-party file manager I used on my laptop. It would probably be easier
    rank 0 sample 1: Hello, I'm a language model, not a problem solver. I should be writing. In the first book, I was in the trouble of proving that
    rank 0 sample 2: Hello, I'm a language model, not a script," he said.

    Banks and regulators will likely be wary of such a move, but for
    rank 0 sample 3: Hello, I'm a language model, you must understand this.

    So what really happened?

    This article would be too short and concise. That

### Training

1.  import modules and functions

``` python
from buildNanoGPT.train import train_GPT, set_random_seed
from buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite
from buildNanoGPT.train import DDPConfig, TrainingConfig, create_model
import torch
```

2.  set seed for random number generator for reproducibility

``` python
set_random_seed(seed=1337) # for reproducibility
```

3.  initiate DDP and Training configs - read the document and modify the
    config parameters as desired

``` python
ddp_cf = DDPConfig()
```

    using device: cuda

``` python
train_cf = TrainingConfig()
```

    using device: cuda

4.  setup train and validation dataloaders

``` python
train_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split='train')
val_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split="val")
```

    found 99 shards for split train
    found 1 shards for split val

5.  set up the GPT model

``` python
model = create_model(ddp_cf)
```

6.  train the GPT model

``` python
train_GPT(model, train_loader, val_loader, train_cf, ddp_cf)
```

    total desired batch size: 524288
    => calculated gradient accumulation steps: 32
    num decayed parameter tensors: 50, with 124,354,560 parameters
    num non-decayed parameter tensors: 98, with 121,344 parameters
    using fused AdamW: True
    validation loss: 10.9834
    HellaSwag accuracy: 2534/10042=0.2523
    step     0 | loss: 10.981724 | lr 6.0000e-05 | norm: 15.4339 | dt: 82819.52ms | tok/sec: 6330.49
    step     1 | loss: 10.157787 | lr 1.2000e-04 | norm: 6.5679 | dt: 10668.81ms | tok/sec: 49142.14
    step     2 | loss: 9.793260 | lr 1.8000e-04 | norm: 2.8270 | dt: 10747.73ms | tok/sec: 48781.28
    step     3 | loss: 9.575678 | lr 2.4000e-04 | norm: 2.2934 | dt: 10789.36ms | tok/sec: 48593.07
    step     4 | loss: 9.409717 | lr 3.0000e-04 | norm: 2.0182 | dt: 10883.30ms | tok/sec: 48173.61
    step     5 | loss: 9.196922 | lr 3.6000e-04 | norm: 2.0160 | dt: 10734.89ms | tok/sec: 48839.61
    step     6 | loss: 8.960140 | lr 4.2000e-04 | norm: 1.8684 | dt: 10902.57ms | tok/sec: 48088.46
    step     7 | loss: 8.707756 | lr 4.8000e-04 | norm: 1.5884 | dt: 10851.94ms | tok/sec: 48312.84
    step     8 | loss: 8.428266 | lr 5.4000e-04 | norm: 1.3737 | dt: 10883.36ms | tok/sec: 48173.34
    step     9 | loss: 8.166906 | lr 6.0000e-04 | norm: 1.1468 | dt: 10797.07ms | tok/sec: 48558.35
    step    10 | loss: 8.857561 | lr 6.0000e-04 | norm: 23.7457 | dt: 10755.35ms | tok/sec: 48746.74
    step    11 | loss: 7.858195 | lr 5.8679e-04 | norm: 0.8712 | dt: 10667.08ms | tok/sec: 49150.09
    step    12 | loss: 7.823021 | lr 5.4843e-04 | norm: 0.7075 | dt: 10793.02ms | tok/sec: 48576.59
    step    13 | loss: 7.755527 | lr 4.8870e-04 | norm: 0.6744 | dt: 10827.16ms | tok/sec: 48423.42
    step    14 | loss: 7.593850 | lr 4.1343e-04 | norm: 0.5836 | dt: 10730.71ms | tok/sec: 48858.64
    step    15 | loss: 7.618423 | lr 3.3000e-04 | norm: 0.6430 | dt: 10648.68ms | tok/sec: 49235.03
    step    16 | loss: 7.664069 | lr 2.4657e-04 | norm: 0.5456 | dt: 10749.31ms | tok/sec: 48774.10
    step    17 | loss: 7.603458 | lr 1.7130e-04 | norm: 0.6211 | dt: 10837.78ms | tok/sec: 48375.97
    step    18 | loss: 7.809735 | lr 1.1157e-04 | norm: 0.4929 | dt: 10698.80ms | tok/sec: 49004.37
    validation loss: 7.6044
    HellaSwag accuracy: 2448/10042=0.2438
    rank 0 sample 0: Hello, I'm a language model,:
     the on a a in is at on in� and are you in the to their for and in the a
    rank 0 sample 1: Hello, I'm a language model,� or an, and or and �, and you by are in
     to a of or. ( of the to
    rank 0 sample 2: Hello, I'm a language model,.
     or:
     the an-, withs,- and to the a.
    , who, and�
    rank 0 sample 3: Hello, I'm a language model, a by� to, for. that of they-, which are for and can- be.
     of:)
    step    19 | loss: 7.893970 | lr 7.3215e-05 | norm: 0.6688 | dt: 85602.68ms | tok/sec: 6124.67

### Load Checkpoint

``` python
from buildNanoGPT.train import train_GPT, set_random_seed
from buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite
from buildNanoGPT.train import DDPConfig, TrainingConfig, create_model, generate_text
import torch
import tiktoken
```

1.  set up the GPT model

``` python
ddp_cf = DDPConfig()
model = create_model(ddp_cf)
```

    using device: cuda

2.  load the model weights from the saved checkpoint

``` python
model_checkpoint = torch.load("log/model_00019.pt")
checkpoint_state_dict = model_checkpoint['model']
model.load_state_dict(checkpoint_state_dict)
```

    <All keys matched successfully>

3.  generate text from saved weights

``` python
enc = tiktoken.get_encoding('gpt2')
generate_text(model, enc, ddp_cf)
```

    rank 0 sample 0: Hello, I'm a language model,:
     the on a a in is at on in� and are you in the to their for and in the a
    rank 0 sample 1: Hello, I'm a language model,� or an, and or and �, and you by are in
     to a of or. ( of the to
    rank 0 sample 2: Hello, I'm a language model,.
     or:
     the an-, withs,- and to the a.
    , who, and�
    rank 0 sample 3: Hello, I'm a language model, a by� to, for. that of they-, which are for and can- be.
     of:)

### Fine-tune from OpenAI’s weights

``` python
from buildNanoGPT.train import train_GPT, set_random_seed
from buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite
from buildNanoGPT.train import DDPConfig, TrainingConfig, create_model, generate_text
import torch
import tiktoken
```

1.  load OpenAI’s pre-trained weights

``` python
ddp_cf = DDPConfig()
model_fine = GPT.from_pretrained("gpt2", ddp_cf.master_process)
model_fine.to(ddp_cf.device)
```

    using device: cuda
    loading weights from pretrained gpt: gpt2

    GPT(
      (transformer): ModuleDict(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (h): ModuleList(
          (0-11): 12 x Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): CausalSelfAttention(
              (c_attn): Linear(in_features=768, out_features=2304, bias=True)
              (c_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (mlp): MLP(
              (c_fc): Linear(in_features=768, out_features=3072, bias=True)
              (gelu): GELU(approximate='tanh')
              (c_proj): Linear(in_features=3072, out_features=768, bias=True)
            )
          )
        )
        (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      )
      (lm_head): Linear(in_features=768, out_features=50257, bias=False)
    )

2.  set seed for reproducibility

``` python
set_random_seed(seed=1337) # for reproducibility
```

3.  set up training parameters - set `max_lr` to a small number since it
    is a fine-tuning step. More advance fine-tuning may include
    supervised fine-tuning (SFT) using custom data and finer control on
    which layer has more or less fine-tuning effects.

``` python
train_cf = TrainingConfig(max_lr=1e-6)
```

    using device: cuda

4.  set up train and validation data-loaders

``` python
train_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split='train')
val_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split="val")
```

    found 99 shards for split train
    found 1 shards for split val

5.  fine tuning the model

``` python
train_GPT(model_fine, train_loader, val_loader, train_cf, ddp_cf)
```

    total desired batch size: 524288
    => calculated gradient accumulation steps: 32
    num decayed parameter tensors: 50, with 124,318,464 parameters
    num non-decayed parameter tensors: 98, with 121,344 parameters
    using fused AdamW: True
    validation loss: 3.2530
    HellaSwag accuracy: 2970/10042=0.2958
    step     0 | loss: 3.279157 | lr 1.0000e-07 | norm: 2.3655 | dt: 80251.91ms | tok/sec: 6533.03
    step     1 | loss: 3.322400 | lr 2.0000e-07 | norm: 2.3916 | dt: 10466.55ms | tok/sec: 50091.77
    step     2 | loss: 3.310521 | lr 3.0000e-07 | norm: 2.5691 | dt: 10404.72ms | tok/sec: 50389.42
    step     3 | loss: 3.403320 | lr 4.0000e-07 | norm: 2.5293 | dt: 10539.22ms | tok/sec: 49746.40
    step     4 | loss: 3.280189 | lr 5.0000e-07 | norm: 2.5589 | dt: 10462.80ms | tok/sec: 50109.70
    step     5 | loss: 3.341536 | lr 6.0000e-07 | norm: 2.4456 | dt: 10489.14ms | tok/sec: 49983.90
    step     6 | loss: 3.388632 | lr 7.0000e-07 | norm: 2.3444 | dt: 10656.34ms | tok/sec: 49199.62
    step     7 | loss: 3.336595 | lr 8.0000e-07 | norm: 2.4381 | dt: 10750.67ms | tok/sec: 48767.94
    step     8 | loss: 3.358722 | lr 9.0000e-07 | norm: 2.0390 | dt: 10728.56ms | tok/sec: 48868.44
    step     9 | loss: 3.303847 | lr 1.0000e-06 | norm: 2.5693 | dt: 10549.71ms | tok/sec: 49696.89
    step    10 | loss: 3.338424 | lr 1.0000e-06 | norm: 2.5449 | dt: 10565.95ms | tok/sec: 49620.54
    step    11 | loss: 3.326447 | lr 9.7798e-07 | norm: 2.2862 | dt: 10577.53ms | tok/sec: 49566.18
    step    12 | loss: 3.297659 | lr 9.1406e-07 | norm: 2.2453 | dt: 10640.80ms | tok/sec: 49271.47
    step    13 | loss: 3.298663 | lr 8.1450e-07 | norm: 2.2228 | dt: 10551.25ms | tok/sec: 49689.67
    step    14 | loss: 3.304088 | lr 6.8906e-07 | norm: 2.5593 | dt: 10415.45ms | tok/sec: 50337.54
    step    15 | loss: 3.373518 | lr 5.5000e-07 | norm: 2.3321 | dt: 10446.78ms | tok/sec: 50186.59
    step    16 | loss: 3.314626 | lr 4.1094e-07 | norm: 2.3768 | dt: 10416.73ms | tok/sec: 50331.33
    step    17 | loss: 3.331042 | lr 2.8550e-07 | norm: 2.1369 | dt: 10248.14ms | tok/sec: 51159.35
    step    18 | loss: 3.334763 | lr 1.8594e-07 | norm: 1.8012 | dt: 10206.37ms | tok/sec: 51368.71
    validation loss: 3.2394
    HellaSwag accuracy: 2959/10042=0.2947
    rank 0 sample 0: Hello, I'm a language model, and I know how it works: You, to my knowledge, invented Java!

    We all do the same stuff
    rank 0 sample 1: Hello, I'm a language model, not a function. It's the last thing that works here, I guess. I think this is very much a misunderstanding
    rank 0 sample 2: Hello, I'm a language model, not a writing language. Let's use a syntax like this (which is a bit different from the one in C):
    rank 0 sample 3: Hello, I'm a language model, you and I can talk about it!" He also said that he doesn't want to use other people's language, nor
    step    19 | loss: 3.189983 | lr 1.2202e-07 | norm: 1.9916 | dt: 80862.14ms | tok/sec: 6483.73

### Visualize the Loss

``` python
from buildNanoGPT.viz import plot_log
```

``` python
plot_log(log_file='log/log_6500steps.txt', sz='124M')
```

    Min Train Loss: 2.997356
    Min Validation Loss: 3.275
    Max Hellaswag eval: 0.2782

![](index_files/figure-commonmark/cell-34-output-2.png)

## How to install

The [buildNanoGPT](https://pypi.org/project/buildNanoGPT/) package was
uploaded to [PyPI](https://pypi.org/) and can be easily installed using
the below command.

`pip install buildNanoGPT`

### Developer install

If you want to develop `buildNanoGPT` yourself, please use an editable
installation.

`git clone https://github.com/hdocmsu/buildNanoGPT.git`

`pip install -e "buildNanoGPT[dev]"`

You also need to use an editable installation of
[nbdev](https://github.com/fastai/nbdev),
[fastcore](https://github.com/fastai/fastcore), and
[execnb](https://github.com/fastai/execnb).

Happy Coding!!!

<div class="alert alert-info">

<b>Note:</b> `buildNanoGPT` is currently Work in Progress (WIP).

</div>

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hdocmsu/buildNanoGPT/",
    "name": "buildNanoGPT",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "nbdev",
    "author": "Hung Do, PhD",
    "author_email": "clinicalcollaborations@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/bf/70/059b79cc1c5e01cf2ca8d9c12c6edbd7a7b2295fa59e0a1c996bdda8088f/buildnanogpt-0.1.5.tar.gz",
    "platform": null,
    "description": "# buildNanoGPT\n\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\n> `buildNanoGPT` is developed based on Andrej Karpathy\u2019s\n> [build-nanoGPT](https://github.com/karpathy/build-nanoGPT) repo and\n> [Let\u2019s reproduce GPT-2\n> (124M)](https://www.youtube.com/watch?v=l8pRSuU81PU) with added notes\n> and details for teaching purposes using\n> [nbdev](https://nbdev.fast.ai/), which enables package development,\n> testing, documentation, and dissemination all in one place - Jupyter\n> Notebook or Visual Studio Code Jupyter Notebook in my case \ud83d\ude04.\n\n## Literate Programming\n\n`buildNanoGPT`\n\n``` mermaid\nflowchart LR\n  A(Andrej's build-nanoGPT) --> C((Combination))\n  B(Jeremy's nbdev) --> C\n  C -->|Literate Programming| D(buildNanoGPT)\n```\n\n`micrograd2023`\n\n<img src='media/literate_programming.svg' width=100% height=auto >\n\n## Disclaimers\n\n`buildNanoGPT` is written based on [Andrej\nKarpathy](https://karpathy.ai/)\u2019s github repo named\n[build-nanoGPT](https://github.com/karpathy/makemore) and his [\u201cNeural\nNetworks: Zero to\nHero\u201d](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)\nlecture series. Specifically the lecture called [Let\u2019s reproduce GPT-2\n(124M)](https://www.youtube.com/watch?v=l8pRSuU81PU).\n\nAndrej is the man who needs no introduction in the field of Deep\nLearning. He released a series of lectures called [Neural Network: Zero\nto Hero](https://karpathy.ai/zero-to-hero.html), which I found extremely\neducational and practical. I am reviewing the lectures and creating\nnotes for myself and for teaching purposes.\n\n`buildNanoGPT` was written using [nbdev](https://nbdev.fast.ai/), which\nwas developed by [Jeremy Howard](https://jeremy.fast.ai/), the man who\nalso needs no introduction in the field of Deep Learning. Jeremy created\n`fastai` Deep Learning software [library](https://docs.fast.ai/) and\n[Courses](https://course.fast.ai/) that are extremely influential. I\nhighly recommend `fastai` if you are interested in starting your journey\nand learning with ML and DL.\n\n`nbdev` is a powerful tool that can be used to efficiently develop,\nbuild, test, document, and distribute software packages all in one\nplace, Jupyter Notebook or Jupyter Notebooks in VS Code, which I am\nusing.\n\nIf you study lectures by Andrej and Jeremy you will probably notice that\nthey are both great educators and utilize both top-down and bottom-up\napproaches in their teaching, but Andrej predominantly uses *bottom-up*\napproach while Jeremy predominantly uses *top-down* one. I personally\nfascinated by both educators and found values from both of them and hope\nyou are too!\n\n## Usage\n\n### Prepare FineWeb-Edu-10B data\n\n``` python\nfrom buildNanoGPT import data\nimport tiktoken\nimport numpy as np\n```\n\n``` python\nenc = tiktoken.get_encoding(\"gpt2\")\neot = enc._special_tokens['<|endoftext|>'] # end of text token\neot\n```\n\n    50256\n\n``` python\nt_ref = [eot]\nt_ref.extend(enc.encode(\"Hello, world!\"))\nt_ref = np.array(t_ref).astype(np.uint16)\nt_ref\n```\n\n    array([50256, 15496,    11,   995,     0], dtype=uint16)\n\n``` python\nt_ref = [eot]\nt_ref.extend(enc.encode(\"Hello, world!\"))\nt_ref = np.array(t_ref).astype(np.int32)\nt_ref\n```\n\n    array([50256, 15496,    11,   995,     0], dtype=int32)\n\n``` python\ndoc = {\"text\":\"Hello, world!\"}\nt_test = data.tokenize(doc)\nt_test\n```\n\n    array([50256, 15496,    11,   995,     0], dtype=uint16)\n\n``` python\nassert np.all(t_ref == t_test)\n```\n\n``` python\n# Download and Prepare the FineWeb-Edu-10B sample Data\ndata.edu_fineweb10B_prep(is_test=True)\n```\n\n    Resolving data files:   0%|          | 0/1630 [00:00<?, ?it/s]\n\n    Loading dataset shards:   0%|          | 0/98 [00:00<?, ?it/s]\n\n    'Hello from `prepare_edu_fineweb10B()`! if you want to download the dataset, set is_test=False and run again.'\n\n### Prepare HellaSwag Evaluation data\n\n``` python\ndata.hellaswag_val_prep(is_test=True)\n```\n\n    'Hello from `hellaswag_val_prep()`! if you want to download the dataset, set is_test=False and run again.'\n\n### Load Pre-trained Weight\n\n``` python\nfrom buildNanoGPT.model import GPT, GPTConfig\nfrom buildNanoGPT.train import DDPConfig, TrainingConfig, generate_text\nimport tiktoken\nimport torch\nfrom torch.nn import functional as F\n```\n\n``` python\nmaster_process = True\nmodel = GPT.from_pretrained(\"gpt2\", master_process)\n```\n\n    loading weights from pretrained gpt: gpt2\n\n``` python\nenc = tiktoken.get_encoding('gpt2')\n```\n\n``` python\nddp_cf = DDPConfig()\nmodel.to(ddp_cf.device)\n```\n\n    using device: cuda\n\n    GPT(\n      (transformer): ModuleDict(\n        (wte): Embedding(50257, 768)\n        (wpe): Embedding(1024, 768)\n        (h): ModuleList(\n          (0-11): 12 x Block(\n            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n            (attn): CausalSelfAttention(\n              (c_attn): Linear(in_features=768, out_features=2304, bias=True)\n              (c_proj): Linear(in_features=768, out_features=768, bias=True)\n            )\n            (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n            (mlp): MLP(\n              (c_fc): Linear(in_features=768, out_features=3072, bias=True)\n              (gelu): GELU(approximate='tanh')\n              (c_proj): Linear(in_features=3072, out_features=768, bias=True)\n            )\n          )\n        )\n        (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n      )\n      (lm_head): Linear(in_features=768, out_features=50257, bias=False)\n    )\n\n``` python\ngenerate_text(model, enc, ddp_cf)\n```\n\n    rank 0 sample 0: Hello, I'm a language model, and I do not want to use some third-party file manager I used on my laptop. It would probably be easier\n    rank 0 sample 1: Hello, I'm a language model, not a problem solver. I should be writing. In the first book, I was in the trouble of proving that\n    rank 0 sample 2: Hello, I'm a language model, not a script,\" he said.\n\n    Banks and regulators will likely be wary of such a move, but for\n    rank 0 sample 3: Hello, I'm a language model, you must understand this.\n\n    So what really happened?\n\n    This article would be too short and concise. That\n\n### Training\n\n1.  import modules and functions\n\n``` python\nfrom buildNanoGPT.train import train_GPT, set_random_seed\nfrom buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite\nfrom buildNanoGPT.train import DDPConfig, TrainingConfig, create_model\nimport torch\n```\n\n2.  set seed for random number generator for reproducibility\n\n``` python\nset_random_seed(seed=1337) # for reproducibility\n```\n\n3.  initiate DDP and Training configs - read the document and modify the\n    config parameters as desired\n\n``` python\nddp_cf = DDPConfig()\n```\n\n    using device: cuda\n\n``` python\ntrain_cf = TrainingConfig()\n```\n\n    using device: cuda\n\n4.  setup train and validation dataloaders\n\n``` python\ntrain_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split='train')\nval_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split=\"val\")\n```\n\n    found 99 shards for split train\n    found 1 shards for split val\n\n5.  set up the GPT model\n\n``` python\nmodel = create_model(ddp_cf)\n```\n\n6.  train the GPT model\n\n``` python\ntrain_GPT(model, train_loader, val_loader, train_cf, ddp_cf)\n```\n\n    total desired batch size: 524288\n    => calculated gradient accumulation steps: 32\n    num decayed parameter tensors: 50, with 124,354,560 parameters\n    num non-decayed parameter tensors: 98, with 121,344 parameters\n    using fused AdamW: True\n    validation loss: 10.9834\n    HellaSwag accuracy: 2534/10042=0.2523\n    step     0 | loss: 10.981724 | lr 6.0000e-05 | norm: 15.4339 | dt: 82819.52ms | tok/sec: 6330.49\n    step     1 | loss: 10.157787 | lr 1.2000e-04 | norm: 6.5679 | dt: 10668.81ms | tok/sec: 49142.14\n    step     2 | loss: 9.793260 | lr 1.8000e-04 | norm: 2.8270 | dt: 10747.73ms | tok/sec: 48781.28\n    step     3 | loss: 9.575678 | lr 2.4000e-04 | norm: 2.2934 | dt: 10789.36ms | tok/sec: 48593.07\n    step     4 | loss: 9.409717 | lr 3.0000e-04 | norm: 2.0182 | dt: 10883.30ms | tok/sec: 48173.61\n    step     5 | loss: 9.196922 | lr 3.6000e-04 | norm: 2.0160 | dt: 10734.89ms | tok/sec: 48839.61\n    step     6 | loss: 8.960140 | lr 4.2000e-04 | norm: 1.8684 | dt: 10902.57ms | tok/sec: 48088.46\n    step     7 | loss: 8.707756 | lr 4.8000e-04 | norm: 1.5884 | dt: 10851.94ms | tok/sec: 48312.84\n    step     8 | loss: 8.428266 | lr 5.4000e-04 | norm: 1.3737 | dt: 10883.36ms | tok/sec: 48173.34\n    step     9 | loss: 8.166906 | lr 6.0000e-04 | norm: 1.1468 | dt: 10797.07ms | tok/sec: 48558.35\n    step    10 | loss: 8.857561 | lr 6.0000e-04 | norm: 23.7457 | dt: 10755.35ms | tok/sec: 48746.74\n    step    11 | loss: 7.858195 | lr 5.8679e-04 | norm: 0.8712 | dt: 10667.08ms | tok/sec: 49150.09\n    step    12 | loss: 7.823021 | lr 5.4843e-04 | norm: 0.7075 | dt: 10793.02ms | tok/sec: 48576.59\n    step    13 | loss: 7.755527 | lr 4.8870e-04 | norm: 0.6744 | dt: 10827.16ms | tok/sec: 48423.42\n    step    14 | loss: 7.593850 | lr 4.1343e-04 | norm: 0.5836 | dt: 10730.71ms | tok/sec: 48858.64\n    step    15 | loss: 7.618423 | lr 3.3000e-04 | norm: 0.6430 | dt: 10648.68ms | tok/sec: 49235.03\n    step    16 | loss: 7.664069 | lr 2.4657e-04 | norm: 0.5456 | dt: 10749.31ms | tok/sec: 48774.10\n    step    17 | loss: 7.603458 | lr 1.7130e-04 | norm: 0.6211 | dt: 10837.78ms | tok/sec: 48375.97\n    step    18 | loss: 7.809735 | lr 1.1157e-04 | norm: 0.4929 | dt: 10698.80ms | tok/sec: 49004.37\n    validation loss: 7.6044\n    HellaSwag accuracy: 2448/10042=0.2438\n    rank 0 sample 0: Hello, I'm a language model,:\n     the on a a in is at on in\ufffd and are you in the to their for and in the a\n    rank 0 sample 1: Hello, I'm a language model,\ufffd or an, and or and \ufffd, and you by are in\n     to a of or. ( of the to\n    rank 0 sample 2: Hello, I'm a language model,.\n     or:\n     the an-, withs,- and to the a.\n    , who, and\ufffd\n    rank 0 sample 3: Hello, I'm a language model, a by\ufffd to, for. that of they-, which are for and can- be.\n     of:)\n    step    19 | loss: 7.893970 | lr 7.3215e-05 | norm: 0.6688 | dt: 85602.68ms | tok/sec: 6124.67\n\n### Load Checkpoint\n\n``` python\nfrom buildNanoGPT.train import train_GPT, set_random_seed\nfrom buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite\nfrom buildNanoGPT.train import DDPConfig, TrainingConfig, create_model, generate_text\nimport torch\nimport tiktoken\n```\n\n1.  set up the GPT model\n\n``` python\nddp_cf = DDPConfig()\nmodel = create_model(ddp_cf)\n```\n\n    using device: cuda\n\n2.  load the model weights from the saved checkpoint\n\n``` python\nmodel_checkpoint = torch.load(\"log/model_00019.pt\")\ncheckpoint_state_dict = model_checkpoint['model']\nmodel.load_state_dict(checkpoint_state_dict)\n```\n\n    <All keys matched successfully>\n\n3.  generate text from saved weights\n\n``` python\nenc = tiktoken.get_encoding('gpt2')\ngenerate_text(model, enc, ddp_cf)\n```\n\n    rank 0 sample 0: Hello, I'm a language model,:\n     the on a a in is at on in\ufffd and are you in the to their for and in the a\n    rank 0 sample 1: Hello, I'm a language model,\ufffd or an, and or and \ufffd, and you by are in\n     to a of or. ( of the to\n    rank 0 sample 2: Hello, I'm a language model,.\n     or:\n     the an-, withs,- and to the a.\n    , who, and\ufffd\n    rank 0 sample 3: Hello, I'm a language model, a by\ufffd to, for. that of they-, which are for and can- be.\n     of:)\n\n### Fine-tune from OpenAI\u2019s weights\n\n``` python\nfrom buildNanoGPT.train import train_GPT, set_random_seed\nfrom buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite\nfrom buildNanoGPT.train import DDPConfig, TrainingConfig, create_model, generate_text\nimport torch\nimport tiktoken\n```\n\n1.  load OpenAI\u2019s pre-trained weights\n\n``` python\nddp_cf = DDPConfig()\nmodel_fine = GPT.from_pretrained(\"gpt2\", ddp_cf.master_process)\nmodel_fine.to(ddp_cf.device)\n```\n\n    using device: cuda\n    loading weights from pretrained gpt: gpt2\n\n    GPT(\n      (transformer): ModuleDict(\n        (wte): Embedding(50257, 768)\n        (wpe): Embedding(1024, 768)\n        (h): ModuleList(\n          (0-11): 12 x Block(\n            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n            (attn): CausalSelfAttention(\n              (c_attn): Linear(in_features=768, out_features=2304, bias=True)\n              (c_proj): Linear(in_features=768, out_features=768, bias=True)\n            )\n            (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n            (mlp): MLP(\n              (c_fc): Linear(in_features=768, out_features=3072, bias=True)\n              (gelu): GELU(approximate='tanh')\n              (c_proj): Linear(in_features=3072, out_features=768, bias=True)\n            )\n          )\n        )\n        (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n      )\n      (lm_head): Linear(in_features=768, out_features=50257, bias=False)\n    )\n\n2.  set seed for reproducibility\n\n``` python\nset_random_seed(seed=1337) # for reproducibility\n```\n\n3.  set up training parameters - set `max_lr` to a small number since it\n    is a fine-tuning step. More advance fine-tuning may include\n    supervised fine-tuning (SFT) using custom data and finer control on\n    which layer has more or less fine-tuning effects.\n\n``` python\ntrain_cf = TrainingConfig(max_lr=1e-6)\n```\n\n    using device: cuda\n\n4.  set up train and validation data-loaders\n\n``` python\ntrain_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split='train')\nval_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split=\"val\")\n```\n\n    found 99 shards for split train\n    found 1 shards for split val\n\n5.  fine tuning the model\n\n``` python\ntrain_GPT(model_fine, train_loader, val_loader, train_cf, ddp_cf)\n```\n\n    total desired batch size: 524288\n    => calculated gradient accumulation steps: 32\n    num decayed parameter tensors: 50, with 124,318,464 parameters\n    num non-decayed parameter tensors: 98, with 121,344 parameters\n    using fused AdamW: True\n    validation loss: 3.2530\n    HellaSwag accuracy: 2970/10042=0.2958\n    step     0 | loss: 3.279157 | lr 1.0000e-07 | norm: 2.3655 | dt: 80251.91ms | tok/sec: 6533.03\n    step     1 | loss: 3.322400 | lr 2.0000e-07 | norm: 2.3916 | dt: 10466.55ms | tok/sec: 50091.77\n    step     2 | loss: 3.310521 | lr 3.0000e-07 | norm: 2.5691 | dt: 10404.72ms | tok/sec: 50389.42\n    step     3 | loss: 3.403320 | lr 4.0000e-07 | norm: 2.5293 | dt: 10539.22ms | tok/sec: 49746.40\n    step     4 | loss: 3.280189 | lr 5.0000e-07 | norm: 2.5589 | dt: 10462.80ms | tok/sec: 50109.70\n    step     5 | loss: 3.341536 | lr 6.0000e-07 | norm: 2.4456 | dt: 10489.14ms | tok/sec: 49983.90\n    step     6 | loss: 3.388632 | lr 7.0000e-07 | norm: 2.3444 | dt: 10656.34ms | tok/sec: 49199.62\n    step     7 | loss: 3.336595 | lr 8.0000e-07 | norm: 2.4381 | dt: 10750.67ms | tok/sec: 48767.94\n    step     8 | loss: 3.358722 | lr 9.0000e-07 | norm: 2.0390 | dt: 10728.56ms | tok/sec: 48868.44\n    step     9 | loss: 3.303847 | lr 1.0000e-06 | norm: 2.5693 | dt: 10549.71ms | tok/sec: 49696.89\n    step    10 | loss: 3.338424 | lr 1.0000e-06 | norm: 2.5449 | dt: 10565.95ms | tok/sec: 49620.54\n    step    11 | loss: 3.326447 | lr 9.7798e-07 | norm: 2.2862 | dt: 10577.53ms | tok/sec: 49566.18\n    step    12 | loss: 3.297659 | lr 9.1406e-07 | norm: 2.2453 | dt: 10640.80ms | tok/sec: 49271.47\n    step    13 | loss: 3.298663 | lr 8.1450e-07 | norm: 2.2228 | dt: 10551.25ms | tok/sec: 49689.67\n    step    14 | loss: 3.304088 | lr 6.8906e-07 | norm: 2.5593 | dt: 10415.45ms | tok/sec: 50337.54\n    step    15 | loss: 3.373518 | lr 5.5000e-07 | norm: 2.3321 | dt: 10446.78ms | tok/sec: 50186.59\n    step    16 | loss: 3.314626 | lr 4.1094e-07 | norm: 2.3768 | dt: 10416.73ms | tok/sec: 50331.33\n    step    17 | loss: 3.331042 | lr 2.8550e-07 | norm: 2.1369 | dt: 10248.14ms | tok/sec: 51159.35\n    step    18 | loss: 3.334763 | lr 1.8594e-07 | norm: 1.8012 | dt: 10206.37ms | tok/sec: 51368.71\n    validation loss: 3.2394\n    HellaSwag accuracy: 2959/10042=0.2947\n    rank 0 sample 0: Hello, I'm a language model, and I know how it works: You, to my knowledge, invented Java!\n\n    We all do the same stuff\n    rank 0 sample 1: Hello, I'm a language model, not a function. It's the last thing that works here, I guess. I think this is very much a misunderstanding\n    rank 0 sample 2: Hello, I'm a language model, not a writing language. Let's use a syntax like this (which is a bit different from the one in C):\n    rank 0 sample 3: Hello, I'm a language model, you and I can talk about it!\" He also said that he doesn't want to use other people's language, nor\n    step    19 | loss: 3.189983 | lr 1.2202e-07 | norm: 1.9916 | dt: 80862.14ms | tok/sec: 6483.73\n\n### Visualize the Loss\n\n``` python\nfrom buildNanoGPT.viz import plot_log\n```\n\n``` python\nplot_log(log_file='log/log_6500steps.txt', sz='124M')\n```\n\n    Min Train Loss: 2.997356\n    Min Validation Loss: 3.275\n    Max Hellaswag eval: 0.2782\n\n![](index_files/figure-commonmark/cell-34-output-2.png)\n\n## How to install\n\nThe [buildNanoGPT](https://pypi.org/project/buildNanoGPT/) package was\nuploaded to [PyPI](https://pypi.org/) and can be easily installed using\nthe below command.\n\n`pip install buildNanoGPT`\n\n### Developer install\n\nIf you want to develop `buildNanoGPT` yourself, please use an editable\ninstallation.\n\n`git clone https://github.com/hdocmsu/buildNanoGPT.git`\n\n`pip install -e \"buildNanoGPT[dev]\"`\n\nYou also need to use an editable installation of\n[nbdev](https://github.com/fastai/nbdev),\n[fastcore](https://github.com/fastai/fastcore), and\n[execnb](https://github.com/fastai/execnb).\n\nHappy Coding!!!\n\n<div class=\"alert alert-info\">\n\n<b>Note:</b> `buildNanoGPT` is currently Work in Progress (WIP).\n\n</div>\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "A template for nbdev-based project",
    "version": "0.1.5",
    "project_urls": {
        "Homepage": "https://github.com/hdocmsu/buildNanoGPT/"
    },
    "split_keywords": [
        "nbdev"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ee3e2c9152e5201218dfba45852e14512cf3388797cb840d676d627eaa5978fa",
                "md5": "43462294e261c369e153dd1b1e00affd",
                "sha256": "2c129846c4dbc4c1a05ef438b59f6669e4456b64c7c1815e46e8f26d2818c95c"
            },
            "downloads": -1,
            "filename": "buildNanoGPT-0.1.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "43462294e261c369e153dd1b1e00affd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 27442,
            "upload_time": "2024-07-09T09:02:50",
            "upload_time_iso_8601": "2024-07-09T09:02:50.127996Z",
            "url": "https://files.pythonhosted.org/packages/ee/3e/2c9152e5201218dfba45852e14512cf3388797cb840d676d627eaa5978fa/buildNanoGPT-0.1.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bf70059b79cc1c5e01cf2ca8d9c12c6edbd7a7b2295fa59e0a1c996bdda8088f",
                "md5": "22d9e7f5b6123d9e569fc7c4a996d5ec",
                "sha256": "9c5ead32cfc44b42462d96d78d507ec88a9588ca61dc65447c494b5014987aa6"
            },
            "downloads": -1,
            "filename": "buildnanogpt-0.1.5.tar.gz",
            "has_sig": false,
            "md5_digest": "22d9e7f5b6123d9e569fc7c4a996d5ec",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 30950,
            "upload_time": "2024-07-09T09:02:51",
            "upload_time_iso_8601": "2024-07-09T09:02:51.861306Z",
            "url": "https://files.pythonhosted.org/packages/bf/70/059b79cc1c5e01cf2ca8d9c12c6edbd7a7b2295fa59e0a1c996bdda8088f/buildnanogpt-0.1.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-09 09:02:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hdocmsu",
    "github_project": "buildNanoGPT",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "buildnanogpt"
}

Hung Do, PhD