# buildNanoGPT
<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
> `buildNanoGPT` is developed based on Andrej Karpathy’s
> [build-nanoGPT](https://github.com/karpathy/build-nanoGPT) repo and
> [Let’s reproduce GPT-2
> (124M)](https://www.youtube.com/watch?v=l8pRSuU81PU) with added notes
> and details for teaching purposes using
> [nbdev](https://nbdev.fast.ai/), which enables package development,
> testing, documentation, and dissemination all in one place - Jupyter
> Notebook or Visual Studio Code Jupyter Notebook in my case 😄.
## Literate Programming
`buildNanoGPT`
``` mermaid
flowchart LR
A(Andrej's build-nanoGPT) --> C((Combination))
B(Jeremy's nbdev) --> C
C -->|Literate Programming| D(buildNanoGPT)
```
`micrograd2023`
<img src='media/literate_programming.svg' width=100% height=auto >
## Disclaimers
`buildNanoGPT` is written based on [Andrej
Karpathy](https://karpathy.ai/)’s github repo named
[build-nanoGPT](https://github.com/karpathy/makemore) and his [“Neural
Networks: Zero to
Hero”](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)
lecture series. Specifically the lecture called [Let’s reproduce GPT-2
(124M)](https://www.youtube.com/watch?v=l8pRSuU81PU).
Andrej is the man who needs no introduction in the field of Deep
Learning. He released a series of lectures called [Neural Network: Zero
to Hero](https://karpathy.ai/zero-to-hero.html), which I found extremely
educational and practical. I am reviewing the lectures and creating
notes for myself and for teaching purposes.
`buildNanoGPT` was written using [nbdev](https://nbdev.fast.ai/), which
was developed by [Jeremy Howard](https://jeremy.fast.ai/), the man who
also needs no introduction in the field of Deep Learning. Jeremy created
`fastai` Deep Learning software [library](https://docs.fast.ai/) and
[Courses](https://course.fast.ai/) that are extremely influential. I
highly recommend `fastai` if you are interested in starting your journey
and learning with ML and DL.
`nbdev` is a powerful tool that can be used to efficiently develop,
build, test, document, and distribute software packages all in one
place, Jupyter Notebook or Jupyter Notebooks in VS Code, which I am
using.
If you study lectures by Andrej and Jeremy you will probably notice that
they are both great educators and utilize both top-down and bottom-up
approaches in their teaching, but Andrej predominantly uses *bottom-up*
approach while Jeremy predominantly uses *top-down* one. I personally
fascinated by both educators and found values from both of them and hope
you are too!
## Usage
### Prepare FineWeb-Edu-10B data
``` python
from buildNanoGPT import data
import tiktoken
import numpy as np
```
``` python
enc = tiktoken.get_encoding("gpt2")
eot = enc._special_tokens['<|endoftext|>'] # end of text token
eot
```
50256
``` python
t_ref = [eot]
t_ref.extend(enc.encode("Hello, world!"))
t_ref = np.array(t_ref).astype(np.uint16)
t_ref
```
array([50256, 15496, 11, 995, 0], dtype=uint16)
``` python
t_ref = [eot]
t_ref.extend(enc.encode("Hello, world!"))
t_ref = np.array(t_ref).astype(np.int32)
t_ref
```
array([50256, 15496, 11, 995, 0], dtype=int32)
``` python
doc = {"text":"Hello, world!"}
t_test = data.tokenize(doc)
t_test
```
array([50256, 15496, 11, 995, 0], dtype=uint16)
``` python
assert np.all(t_ref == t_test)
```
``` python
# Download and Prepare the FineWeb-Edu-10B sample Data
data.edu_fineweb10B_prep(is_test=True)
```
Resolving data files: 0%| | 0/1630 [00:00<?, ?it/s]
Loading dataset shards: 0%| | 0/98 [00:00<?, ?it/s]
'Hello from `prepare_edu_fineweb10B()`! if you want to download the dataset, set is_test=False and run again.'
### Prepare HellaSwag Evaluation data
``` python
data.hellaswag_val_prep(is_test=True)
```
'Hello from `hellaswag_val_prep()`! if you want to download the dataset, set is_test=False and run again.'
### Load Pre-trained Weight
``` python
from buildNanoGPT.model import GPT, GPTConfig
from buildNanoGPT.train import DDPConfig, TrainingConfig, generate_text
import tiktoken
import torch
from torch.nn import functional as F
```
``` python
master_process = True
model = GPT.from_pretrained("gpt2", master_process)
```
loading weights from pretrained gpt: gpt2
``` python
enc = tiktoken.get_encoding('gpt2')
```
``` python
ddp_cf = DDPConfig()
model.to(ddp_cf.device)
```
using device: cuda
GPT(
(transformer): ModuleDict(
(wte): Embedding(50257, 768)
(wpe): Embedding(1024, 768)
(h): ModuleList(
(0-11): 12 x Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): CausalSelfAttention(
(c_attn): Linear(in_features=768, out_features=2304, bias=True)
(c_proj): Linear(in_features=768, out_features=768, bias=True)
)
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MLP(
(c_fc): Linear(in_features=768, out_features=3072, bias=True)
(gelu): GELU(approximate='tanh')
(c_proj): Linear(in_features=3072, out_features=768, bias=True)
)
)
)
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
``` python
generate_text(model, enc, ddp_cf)
```
rank 0 sample 0: Hello, I'm a language model, and I do not want to use some third-party file manager I used on my laptop. It would probably be easier
rank 0 sample 1: Hello, I'm a language model, not a problem solver. I should be writing. In the first book, I was in the trouble of proving that
rank 0 sample 2: Hello, I'm a language model, not a script," he said.
Banks and regulators will likely be wary of such a move, but for
rank 0 sample 3: Hello, I'm a language model, you must understand this.
So what really happened?
This article would be too short and concise. That
### Training
``` python
# either running 03_train.ipynb or short-cut by running train script from the buildNanoGPT package
from buildNanoGPT import train
```
using device: cuda
total desired batch size: 524288
=> calculated gradient accumulation steps: 32
found 99 shards for split train
found 1 shards for split val
num decayed parameter tensors: 50, with 124,354,560 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True
validation loss: 10.9834
HellaSwag accuracy: 2534/10042=0.2523
step 0 | loss: 10.981724 | lr 6.0000e-06 | norm: 15.4339 | dt: 82809.98ms | tok/sec: 6331.22
step 1 | loss: 10.655205 | lr 1.2000e-05 | norm: 12.4931 | dt: 10492.83ms | tok/sec: 49966.29
step 2 | loss: 10.274603 | lr 1.8000e-05 | norm: 7.7501 | dt: 10522.88ms | tok/sec: 49823.61
step 3 | loss: 10.004156 | lr 2.4000e-05 | norm: 5.2698 | dt: 10481.91ms | tok/sec: 50018.35
step 4 | loss: 9.833108 | lr 3.0000e-05 | norm: 3.6179 | dt: 10495.18ms | tok/sec: 49955.14
step 5 | loss: 9.711222 | lr 3.6000e-05 | norm: 2.7871 | dt: 10484.25ms | tok/sec: 50007.21
step 6 | loss: 9.642426 | lr 4.2000e-05 | norm: 2.4048 | dt: 10679.06ms | tok/sec: 49094.97
step 7 | loss: 9.612312 | lr 4.8000e-05 | norm: 2.3183 | dt: 10555.78ms | tok/sec: 49668.32
step 8 | loss: 9.558184 | lr 5.4000e-05 | norm: 2.2464 | dt: 10685.39ms | tok/sec: 49065.86
step 9 | loss: 9.526472 | lr 6.0000e-05 | norm: 2.2171 | dt: 10548.39ms | tok/sec: 49703.14
step 10 | loss: 9.463450 | lr 6.6000e-05 | norm: 2.1546 | dt: 10559.73ms | tok/sec: 49649.78
step 11 | loss: 9.413282 | lr 7.2000e-05 | norm: 2.1401 | dt: 10495.94ms | tok/sec: 49951.49
step 12 | loss: 9.340552 | lr 7.8000e-05 | norm: 2.0149 | dt: 10668.78ms | tok/sec: 49142.26
step 13 | loss: 9.278631 | lr 8.4000e-05 | norm: 1.9368 | dt: 10605.16ms | tok/sec: 49437.05
step 14 | loss: 9.159446 | lr 9.0000e-05 | norm: 1.9737 | dt: 10701.77ms | tok/sec: 48990.76
step 15 | loss: 9.111786 | lr 9.6000e-05 | norm: 3.0525 | dt: 10732.83ms | tok/sec: 48849.00
step 16 | loss: 9.029915 | lr 1.0200e-04 | norm: 1.9619 | dt: 10790.65ms | tok/sec: 48587.23
step 17 | loss: 8.937255 | lr 1.0800e-04 | norm: 1.8786 | dt: 10621.46ms | tok/sec: 49361.22
step 18 | loss: 8.955976 | lr 1.1400e-04 | norm: 2.0179 | dt: 10545.33ms | tok/sec: 49717.53
step 19 | loss: 8.888343 | lr 1.2000e-04 | norm: 1.9142 | dt: 10598.08ms | tok/sec: 49470.11
step 20 | loss: 8.672051 | lr 1.2600e-04 | norm: 1.7543 | dt: 10730.04ms | tok/sec: 48861.68
step 21 | loss: 8.556496 | lr 1.3200e-04 | norm: 1.6246 | dt: 10822.08ms | tok/sec: 48446.13
step 22 | loss: 8.463942 | lr 1.3800e-04 | norm: 1.4898 | dt: 10733.11ms | tok/sec: 48847.72
step 23 | loss: 8.389053 | lr 1.4400e-04 | norm: 1.9412 | dt: 10555.51ms | tok/sec: 49669.61
step 24 | loss: 8.257857 | lr 1.5000e-04 | norm: 2.0539 | dt: 10732.67ms | tok/sec: 48849.75
step 25 | loss: 8.128786 | lr 1.5600e-04 | norm: 1.4269 | dt: 10609.93ms | tok/sec: 49414.84
step 26 | loss: 8.098352 | lr 1.6200e-04 | norm: 2.0206 | dt: 10487.59ms | tok/sec: 49991.30
step 27 | loss: 7.961097 | lr 1.6800e-04 | norm: 1.2978 | dt: 10578.22ms | tok/sec: 49562.95
step 28 | loss: 7.884172 | lr 1.7400e-04 | norm: 1.2289 | dt: 10497.51ms | tok/sec: 49944.04
step 29 | loss: 7.765845 | lr 1.8000e-04 | norm: 1.1969 | dt: 10724.78ms | tok/sec: 48885.65
step 30 | loss: 7.821087 | lr 1.8600e-04 | norm: 1.0228 | dt: 10792.80ms | tok/sec: 48577.58
step 31 | loss: 7.689835 | lr 1.9200e-04 | norm: 0.9216 | dt: 10752.80ms | tok/sec: 48758.30
step 32 | loss: 7.641486 | lr 1.9800e-04 | norm: 0.8666 | dt: 10985.01ms | tok/sec: 47727.58
step 33 | loss: 7.572504 | lr 2.0400e-04 | norm: 0.7996 | dt: 10684.39ms | tok/sec: 49070.46
step 34 | loss: 7.429519 | lr 2.1000e-04 | norm: 0.7874 | dt: 10696.01ms | tok/sec: 49017.15
step 35 | loss: 7.414855 | lr 2.1600e-04 | norm: 0.7272 | dt: 10580.76ms | tok/sec: 49551.08
step 36 | loss: 7.393157 | lr 2.2200e-04 | norm: 0.8536 | dt: 10748.95ms | tok/sec: 48775.74
step 37 | loss: 7.287198 | lr 2.2800e-04 | norm: 0.5487 | dt: 10921.08ms | tok/sec: 48006.98
step 38 | loss: 7.252760 | lr 2.3400e-04 | norm: 0.4738 | dt: 10716.44ms | tok/sec: 48923.69
step 39 | loss: 7.292991 | lr 2.4000e-04 | norm: 0.5769 | dt: 10659.42ms | tok/sec: 49185.43
step 40 | loss: 7.251584 | lr 2.4600e-04 | norm: 0.9509 | dt: 10570.06ms | tok/sec: 49601.22
step 41 | loss: 7.209351 | lr 2.5200e-04 | norm: 1.7773 | dt: 10611.45ms | tok/sec: 49407.78
step 42 | loss: 7.140303 | lr 2.5800e-04 | norm: 0.9441 | dt: 10753.44ms | tok/sec: 48755.36
step 43 | loss: 7.216593 | lr 2.6400e-04 | norm: 2.1513 | dt: 10632.68ms | tok/sec: 49309.09
step 44 | loss: 7.155683 | lr 2.7000e-04 | norm: 1.3599 | dt: 10780.88ms | tok/sec: 48631.27
step 45 | loss: 7.159153 | lr 2.7600e-04 | norm: 1.1990 | dt: 10722.27ms | tok/sec: 48897.11
step 46 | loss: 7.126624 | lr 2.8200e-04 | norm: 0.8272 | dt: 10791.48ms | tok/sec: 48583.50
step 47 | loss: 7.190242 | lr 2.8800e-04 | norm: 0.9578 | dt: 10718.49ms | tok/sec: 48914.35
step 48 | loss: 7.194102 | lr 2.9400e-04 | norm: 0.7273 | dt: 10651.67ms | tok/sec: 49221.22
step 49 | loss: 7.113352 | lr 3.0000e-04 | norm: 1.1239 | dt: 10732.94ms | tok/sec: 48848.51
step 50 | loss: 7.169769 | lr 3.0600e-04 | norm: 1.0528 | dt: 10706.81ms | tok/sec: 48967.72
step 51 | loss: 7.103631 | lr 3.1200e-04 | norm: 1.0537 | dt: 10826.62ms | tok/sec: 48425.82
step 52 | loss: 7.092214 | lr 3.1800e-04 | norm: 0.7355 | dt: 10777.80ms | tok/sec: 48645.18
step 53 | loss: 7.021073 | lr 3.2400e-04 | norm: 0.8493 | dt: 10907.12ms | tok/sec: 48068.41
step 54 | loss: 7.030515 | lr 3.3000e-04 | norm: 0.7924 | dt: 10822.94ms | tok/sec: 48442.27
step 55 | loss: 7.027347 | lr 3.3600e-04 | norm: 0.8563 | dt: 10661.62ms | tok/sec: 49175.26
step 56 | loss: 7.007086 | lr 3.4200e-04 | norm: 1.2067 | dt: 10764.39ms | tok/sec: 48705.77
step 57 | loss: 6.978011 | lr 3.4800e-04 | norm: 0.5606 | dt: 10967.17ms | tok/sec: 47805.22
step 58 | loss: 6.919628 | lr 3.5400e-04 | norm: 1.3408 | dt: 10802.21ms | tok/sec: 48535.23
step 59 | loss: 6.887385 | lr 3.6000e-04 | norm: 1.3971 | dt: 10907.45ms | tok/sec: 48066.97
step 60 | loss: 6.879627 | lr 3.6600e-04 | norm: 0.7581 | dt: 10768.36ms | tok/sec: 48687.80
step 61 | loss: 6.906055 | lr 3.7200e-04 | norm: 0.9657 | dt: 10613.11ms | tok/sec: 49400.03
step 62 | loss: 6.795964 | lr 3.7800e-04 | norm: 0.6819 | dt: 10593.62ms | tok/sec: 49490.92
step 63 | loss: 6.780255 | lr 3.8400e-04 | norm: 0.7485 | dt: 10719.51ms | tok/sec: 48909.68
step 64 | loss: 6.767306 | lr 3.9000e-04 | norm: 0.7399 | dt: 10806.62ms | tok/sec: 48515.44
step 65 | loss: 6.801779 | lr 3.9600e-04 | norm: 0.7439 | dt: 10609.56ms | tok/sec: 49416.58
step 66 | loss: 6.721136 | lr 4.0200e-04 | norm: 0.5727 | dt: 10749.83ms | tok/sec: 48771.73
step 67 | loss: 6.750595 | lr 4.0800e-04 | norm: 0.7310 | dt: 10711.53ms | tok/sec: 48946.13
step 68 | loss: 6.730660 | lr 4.1400e-04 | norm: 0.5052 | dt: 10772.71ms | tok/sec: 48668.16
step 69 | loss: 6.631037 | lr 4.2000e-04 | norm: 0.6577 | dt: 10736.56ms | tok/sec: 48832.04
step 70 | loss: 6.612390 | lr 4.2600e-04 | norm: 0.6208 | dt: 10598.25ms | tok/sec: 49469.31
step 71 | loss: 6.643014 | lr 4.3200e-04 | norm: 0.6751 | dt: 10712.97ms | tok/sec: 48939.57
step 72 | loss: 6.602534 | lr 4.3800e-04 | norm: 0.8274 | dt: 10685.25ms | tok/sec: 49066.50
step 73 | loss: 6.606695 | lr 4.4400e-04 | norm: 1.0497 | dt: 10784.33ms | tok/sec: 48615.72
step 74 | loss: 6.532132 | lr 4.5000e-04 | norm: 0.9483 | dt: 11051.53ms | tok/sec: 47440.31
step 75 | loss: 6.571723 | lr 4.5600e-04 | norm: 0.5493 | dt: 10943.98ms | tok/sec: 47906.50
step 76 | loss: 6.519442 | lr 4.6200e-04 | norm: 0.6364 | dt: 11138.90ms | tok/sec: 47068.20
step 77 | loss: 6.553431 | lr 4.6800e-04 | norm: 0.6423 | dt: 10943.91ms | tok/sec: 47906.81
step 78 | loss: 6.525961 | lr 4.7400e-04 | norm: 0.4541 | dt: 10733.66ms | tok/sec: 48845.21
step 79 | loss: 6.474160 | lr 4.8000e-04 | norm: 0.6690 | dt: 10748.03ms | tok/sec: 48779.93
step 80 | loss: 6.481711 | lr 4.8600e-04 | norm: 0.5859 | dt: 10679.49ms | tok/sec: 49093.00
step 81 | loss: 6.486966 | lr 4.9200e-04 | norm: 0.6897 | dt: 10656.78ms | tok/sec: 49197.58
step 82 | loss: 6.430150 | lr 4.9800e-04 | norm: 0.6284 | dt: 10426.83ms | tok/sec: 50282.59
step 83 | loss: 6.387268 | lr 5.0400e-04 | norm: 0.5746 | dt: 10644.15ms | tok/sec: 49255.97
step 84 | loss: 6.405340 | lr 5.1000e-04 | norm: 0.5523 | dt: 10856.28ms | tok/sec: 48293.53
step 85 | loss: 6.371199 | lr 5.1600e-04 | norm: 0.6764 | dt: 10573.15ms | tok/sec: 49586.76
step 86 | loss: 6.367082 | lr 5.2200e-04 | norm: 0.7355 | dt: 10731.52ms | tok/sec: 48854.94
step 87 | loss: 6.404164 | lr 5.2800e-04 | norm: 0.7907 | dt: 10878.82ms | tok/sec: 48193.45
step 88 | loss: 6.383866 | lr 5.3400e-04 | norm: 0.7472 | dt: 10855.23ms | tok/sec: 48298.20
step 89 | loss: 6.428278 | lr 5.4000e-04 | norm: 0.7306 | dt: 10751.87ms | tok/sec: 48762.51
step 90 | loss: 6.355624 | lr 5.4600e-04 | norm: 0.6458 | dt: 10799.97ms | tok/sec: 48545.31
step 91 | loss: 6.356147 | lr 5.5200e-04 | norm: 0.5809 | dt: 10756.22ms | tok/sec: 48742.76
step 92 | loss: 6.407714 | lr 5.5800e-04 | norm: 0.5222 | dt: 10799.32ms | tok/sec: 48548.25
step 93 | loss: 6.488331 | lr 5.6400e-04 | norm: 0.8362 | dt: 10773.78ms | tok/sec: 48663.34
step 94 | loss: 6.541770 | lr 5.7000e-04 | norm: 1.7085 | dt: 10864.89ms | tok/sec: 48255.23
step 95 | loss: 6.541307 | lr 5.7600e-04 | norm: 1.3723 | dt: 10788.27ms | tok/sec: 48597.98
step 96 | loss: 6.460635 | lr 5.8200e-04 | norm: 0.7749 | dt: 10840.03ms | tok/sec: 48365.92
step 97 | loss: 6.439204 | lr 5.8800e-04 | norm: 1.0601 | dt: 10847.54ms | tok/sec: 48332.45
step 98 | loss: 6.489636 | lr 5.9400e-04 | norm: 1.1039 | dt: 10751.69ms | tok/sec: 48763.31
step 99 | loss: 6.463543 | lr 6.0000e-04 | norm: 1.1220 | dt: 11026.37ms | tok/sec: 47548.54
step 100 | loss: 6.475557 | lr 6.0000e-04 | norm: 0.8641 | dt: 10706.05ms | tok/sec: 48971.19
step 101 | loss: 6.403978 | lr 5.9987e-04 | norm: 0.6312 | dt: 10799.40ms | tok/sec: 48547.87
step 102 | loss: 6.399425 | lr 5.9947e-04 | norm: 0.9644 | dt: 10571.53ms | tok/sec: 49594.33
step 103 | loss: 6.291117 | lr 5.9880e-04 | norm: 0.8341 | dt: 10589.38ms | tok/sec: 49510.71
step 104 | loss: 6.395230 | lr 5.9787e-04 | norm: 0.6783 | dt: 10603.40ms | tok/sec: 49445.27
step 105 | loss: 6.381511 | lr 5.9668e-04 | norm: 0.5386 | dt: 10608.30ms | tok/sec: 49422.43
step 106 | loss: 6.345720 | lr 5.9522e-04 | norm: 0.4796 | dt: 10714.76ms | tok/sec: 48931.39
step 107 | loss: 6.295020 | lr 5.9350e-04 | norm: 0.5316 | dt: 10712.39ms | tok/sec: 48942.19
step 108 | loss: 6.354154 | lr 5.9152e-04 | norm: 0.4104 | dt: 10863.69ms | tok/sec: 48260.57
step 109 | loss: 6.346787 | lr 5.8928e-04 | norm: 0.5001 | dt: 10882.25ms | tok/sec: 48178.25
step 110 | loss: 6.309251 | lr 5.8679e-04 | norm: 0.4883 | dt: 10608.02ms | tok/sec: 49423.72
step 111 | loss: 6.281376 | lr 5.8404e-04 | norm: 0.5975 | dt: 10248.73ms | tok/sec: 51156.40
step 112 | loss: 6.262320 | lr 5.8104e-04 | norm: 0.4393 | dt: 9123.81ms | tok/sec: 57463.69
step 113 | loss: 6.289036 | lr 5.7779e-04 | norm: 0.4367 | dt: 9033.14ms | tok/sec: 58040.48
step 114 | loss: 6.315429 | lr 5.7430e-04 | norm: 0.5169 | dt: 9021.78ms | tok/sec: 58113.61
step 115 | loss: 6.286012 | lr 5.7057e-04 | norm: 0.5163 | dt: 9020.69ms | tok/sec: 58120.62
step 116 | loss: 6.218066 | lr 5.6660e-04 | norm: 0.4813 | dt: 9021.61ms | tok/sec: 58114.68
step 117 | loss: 6.163318 | lr 5.6240e-04 | norm: 0.5648 | dt: 9018.39ms | tok/sec: 58135.44
step 118 | loss: 6.194816 | lr 5.5797e-04 | norm: 0.7243 | dt: 9019.76ms | tok/sec: 58126.63
step 119 | loss: 6.205301 | lr 5.5331e-04 | norm: 0.5606 | dt: 9019.06ms | tok/sec: 58131.12
step 120 | loss: 6.187188 | lr 5.4843e-04 | norm: 0.5205 | dt: 9021.12ms | tok/sec: 58117.87
step 121 | loss: 6.149425 | lr 5.4334e-04 | norm: 0.5132 | dt: 9019.32ms | tok/sec: 58129.44
step 122 | loss: 6.156881 | lr 5.3804e-04 | norm: 0.4721 | dt: 9030.19ms | tok/sec: 58059.47
step 123 | loss: 6.160114 | lr 5.3253e-04 | norm: 0.5163 | dt: 9019.01ms | tok/sec: 58131.42
step 124 | loss: 6.161614 | lr 5.2682e-04 | norm: 0.3730 | dt: 9021.48ms | tok/sec: 58115.54
step 125 | loss: 6.162668 | lr 5.2092e-04 | norm: 0.4222 | dt: 9022.97ms | tok/sec: 58105.90
step 126 | loss: 6.142958 | lr 5.1483e-04 | norm: 0.3661 | dt: 9025.08ms | tok/sec: 58092.34
step 127 | loss: 6.107336 | lr 5.0855e-04 | norm: 0.3189 | dt: 9022.45ms | tok/sec: 58109.29
step 128 | loss: 6.059753 | lr 5.0210e-04 | norm: 0.3107 | dt: 9017.19ms | tok/sec: 58143.18
step 129 | loss: 6.064310 | lr 4.9548e-04 | norm: 0.3808 | dt: 9027.10ms | tok/sec: 58079.35
step 130 | loss: 6.106601 | lr 4.8870e-04 | norm: 0.3701 | dt: 9025.69ms | tok/sec: 58088.43
step 131 | loss: 6.069602 | lr 4.8176e-04 | norm: 0.3277 | dt: 9014.09ms | tok/sec: 58163.13
step 132 | loss: 6.078692 | lr 4.7467e-04 | norm: 0.3552 | dt: 9023.25ms | tok/sec: 58104.11
step 133 | loss: 5.993310 | lr 4.6744e-04 | norm: 0.4006 | dt: 9025.95ms | tok/sec: 58086.72
step 134 | loss: 6.013237 | lr 4.6007e-04 | norm: 0.4799 | dt: 9018.29ms | tok/sec: 58136.08
step 135 | loss: 6.053710 | lr 4.5258e-04 | norm: 0.4524 | dt: 9032.08ms | tok/sec: 58047.32
step 136 | loss: 6.033798 | lr 4.4496e-04 | norm: 0.3394 | dt: 9026.17ms | tok/sec: 58085.34
step 137 | loss: 6.055409 | lr 4.3723e-04 | norm: 0.3845 | dt: 9021.99ms | tok/sec: 58112.23
step 138 | loss: 6.007836 | lr 4.2939e-04 | norm: 0.4304 | dt: 9036.39ms | tok/sec: 58019.64
step 139 | loss: 6.109036 | lr 4.2146e-04 | norm: 0.3833 | dt: 9019.26ms | tok/sec: 58129.83
step 140 | loss: 6.218612 | lr 4.1343e-04 | norm: 0.3712 | dt: 9023.31ms | tok/sec: 58103.75
step 141 | loss: 6.109329 | lr 4.0533e-04 | norm: 0.3751 | dt: 9032.97ms | tok/sec: 58041.61
step 142 | loss: 6.157863 | lr 3.9715e-04 | norm: 0.3973 | dt: 9026.04ms | tok/sec: 58086.18
step 143 | loss: 6.105368 | lr 3.8890e-04 | norm: 0.4718 | dt: 9024.10ms | tok/sec: 58098.66
step 144 | loss: 6.112780 | lr 3.8059e-04 | norm: 0.5495 | dt: 9024.62ms | tok/sec: 58095.32
step 145 | loss: 6.094649 | lr 3.7224e-04 | norm: 0.4203 | dt: 9219.34ms | tok/sec: 56868.26
step 146 | loss: 6.120586 | lr 3.6384e-04 | norm: 0.3370 | dt: 9118.57ms | tok/sec: 57496.73
step 147 | loss: 6.128690 | lr 3.5541e-04 | norm: 0.3505 | dt: 9019.72ms | tok/sec: 58126.89
step 148 | loss: 6.126965 | lr 3.4695e-04 | norm: 0.3768 | dt: 9027.35ms | tok/sec: 58077.76
step 149 | loss: 6.087430 | lr 3.3848e-04 | norm: 0.2887 | dt: 9014.19ms | tok/sec: 58162.52
step 150 | loss: 6.099020 | lr 3.3000e-04 | norm: 0.3975 | dt: 9018.30ms | tok/sec: 58135.99
step 151 | loss: 6.011409 | lr 3.2152e-04 | norm: 0.3445 | dt: 9018.43ms | tok/sec: 58135.15
step 152 | loss: 6.053518 | lr 3.1305e-04 | norm: 0.2765 | dt: 9021.84ms | tok/sec: 58113.21
step 153 | loss: 6.096207 | lr 3.0459e-04 | norm: 0.3268 | dt: 9022.89ms | tok/sec: 58106.44
step 154 | loss: 6.014778 | lr 2.9616e-04 | norm: 0.4205 | dt: 9023.55ms | tok/sec: 58102.17
step 155 | loss: 5.993350 | lr 2.8776e-04 | norm: 0.2954 | dt: 9016.75ms | tok/sec: 58146.04
step 156 | loss: 6.027627 | lr 2.7941e-04 | norm: 0.3306 | dt: 9031.58ms | tok/sec: 58050.53
step 157 | loss: 6.092584 | lr 2.7110e-04 | norm: 0.3101 | dt: 9025.19ms | tok/sec: 58091.62
step 158 | loss: 6.105118 | lr 2.6285e-04 | norm: 0.2992 | dt: 9019.38ms | tok/sec: 58129.02
step 159 | loss: 6.017125 | lr 2.5467e-04 | norm: 0.3080 | dt: 9016.80ms | tok/sec: 58145.71
step 160 | loss: 5.959670 | lr 2.4657e-04 | norm: 0.2711 | dt: 9024.38ms | tok/sec: 58096.83
step 161 | loss: 6.058784 | lr 2.3854e-04 | norm: 0.2906 | dt: 9024.04ms | tok/sec: 58099.06
step 162 | loss: 5.958908 | lr 2.3061e-04 | norm: 0.2375 | dt: 9025.14ms | tok/sec: 58091.94
step 163 | loss: 5.928731 | lr 2.2277e-04 | norm: 0.3086 | dt: 9024.43ms | tok/sec: 58096.51
step 164 | loss: 5.932847 | lr 2.1504e-04 | norm: 0.2456 | dt: 9031.44ms | tok/sec: 58051.43
step 165 | loss: 5.987537 | lr 2.0742e-04 | norm: 0.3180 | dt: 9034.98ms | tok/sec: 58028.72
step 166 | loss: 5.846995 | lr 1.9993e-04 | norm: 0.3659 | dt: 9028.83ms | tok/sec: 58068.20
step 167 | loss: 5.949950 | lr 1.9256e-04 | norm: 0.3790 | dt: 9024.60ms | tok/sec: 58095.40
step 168 | loss: 5.925792 | lr 1.8533e-04 | norm: 0.2998 | dt: 9023.44ms | tok/sec: 58102.89
step 169 | loss: 5.927565 | lr 1.7824e-04 | norm: 0.3140 | dt: 9021.85ms | tok/sec: 58113.12
step 170 | loss: 5.913670 | lr 1.7130e-04 | norm: 0.3304 | dt: 9031.75ms | tok/sec: 58049.43
step 171 | loss: 5.944331 | lr 1.6452e-04 | norm: 0.2440 | dt: 9029.87ms | tok/sec: 58061.54
step 172 | loss: 5.913747 | lr 1.5790e-04 | norm: 0.3646 | dt: 9022.08ms | tok/sec: 58111.66
step 173 | loss: 5.894815 | lr 1.5145e-04 | norm: 0.2861 | dt: 9027.03ms | tok/sec: 58079.77
step 174 | loss: 5.846126 | lr 1.4517e-04 | norm: 0.2546 | dt: 9021.01ms | tok/sec: 58118.55
step 175 | loss: 5.903183 | lr 1.3908e-04 | norm: 0.2809 | dt: 9023.06ms | tok/sec: 58105.36
step 176 | loss: 5.857369 | lr 1.3318e-04 | norm: 0.2143 | dt: 9018.74ms | tok/sec: 58133.16
step 177 | loss: 5.902529 | lr 1.2747e-04 | norm: 0.2514 | dt: 9017.04ms | tok/sec: 58144.11
step 178 | loss: 5.833840 | lr 1.2196e-04 | norm: 0.2743 | dt: 9027.74ms | tok/sec: 58075.25
step 179 | loss: 5.825159 | lr 1.1666e-04 | norm: 0.2201 | dt: 9018.76ms | tok/sec: 58133.05
step 180 | loss: 5.823802 | lr 1.1157e-04 | norm: 0.2582 | dt: 9026.48ms | tok/sec: 58083.35
step 181 | loss: 5.850857 | lr 1.0669e-04 | norm: 0.2286 | dt: 9032.15ms | tok/sec: 58046.85
step 182 | loss: 5.852230 | lr 1.0203e-04 | norm: 0.2073 | dt: 9025.26ms | tok/sec: 58091.20
step 183 | loss: 5.848113 | lr 9.7600e-05 | norm: 0.2366 | dt: 9030.65ms | tok/sec: 58056.49
step 184 | loss: 5.875956 | lr 9.3397e-05 | norm: 0.2153 | dt: 9036.46ms | tok/sec: 58019.15
step 185 | loss: 5.925734 | lr 8.9428e-05 | norm: 0.2497 | dt: 9028.40ms | tok/sec: 58070.95
step 186 | loss: 5.951926 | lr 8.5697e-05 | norm: 0.2276 | dt: 9026.99ms | tok/sec: 58080.07
step 187 | loss: 6.008245 | lr 8.2206e-05 | norm: 0.2261 | dt: 9028.58ms | tok/sec: 58069.83
step 188 | loss: 5.967976 | lr 7.8960e-05 | norm: 0.2432 | dt: 9014.70ms | tok/sec: 58159.21
step 189 | loss: 5.948523 | lr 7.5962e-05 | norm: 0.2389 | dt: 9028.44ms | tok/sec: 58070.69
step 190 | loss: 5.992687 | lr 7.3215e-05 | norm: 0.2226 | dt: 9238.47ms | tok/sec: 56750.51
step 191 | loss: 5.945471 | lr 7.0721e-05 | norm: 0.2415 | dt: 9019.78ms | tok/sec: 58126.50
step 192 | loss: 5.965812 | lr 6.8483e-05 | norm: 0.2324 | dt: 9027.71ms | tok/sec: 58075.41
step 193 | loss: 5.967308 | lr 6.6502e-05 | norm: 0.2536 | dt: 9022.99ms | tok/sec: 58105.79
step 194 | loss: 5.894364 | lr 6.4782e-05 | norm: 0.2591 | dt: 9029.48ms | tok/sec: 58064.04
step 195 | loss: 5.926851 | lr 6.3324e-05 | norm: 0.2056 | dt: 9032.62ms | tok/sec: 58043.81
step 196 | loss: 5.889875 | lr 6.2129e-05 | norm: 0.2426 | dt: 9032.17ms | tok/sec: 58046.73
step 197 | loss: 5.931971 | lr 6.1198e-05 | norm: 0.2178 | dt: 9028.09ms | tok/sec: 58072.95
step 198 | loss: 5.929649 | lr 6.0533e-05 | norm: 0.2386 | dt: 9027.39ms | tok/sec: 58077.47
validation loss: 5.9230
HellaSwag accuracy: 2440/10042=0.2430
rank 0 sample 0: Hello, I'm a language model, and the new student, we don't give for our study. A person to the child from all, I have no
rank 0 sample 1: Hello, I'm a language model, then go to be the original work without the most simple idea. But now is a good idea is very good topic for
rank 0 sample 2: Hello, I'm a language model, the two number of light for the time, this post, is to the same amount of the same same time. Some
rank 0 sample 3: Hello, I'm a language model, which allows the first, or the data is by the current application.
A video in the text or the same and
step 199 | loss: 5.896696 | lr 6.0133e-05 | norm: 0.2305 | dt: 77571.04ms | tok/sec: 6758.81
### Visualize the Loss
``` python
from buildNanoGPT.viz import plot_log
```
``` python
plot_log(log_file='log/log_6500steps.txt', sz='124M')
```
Min Train Loss: 2.997356
Min Validation Loss: 3.275
Max Hellaswag eval: 0.2782
![](index_files/figure-commonmark/cell-18-output-2.png)
## How to install
The [buildNanoGPT](https://pypi.org/project/buildNanoGPT/) package was
uploaded to [PyPI](https://pypi.org/) and can be easily installed using
the below command.
`pip install buildNanoGPT`
### Developer install
If you want to develop `buildNanoGPT` yourself, please use an editable
installation.
`git clone https://github.com/hdocmsu/buildNanoGPT.git`
`pip install -e "buildNanoGPT[dev]"`
You also need to use an editable installation of
[nbdev](https://github.com/fastai/nbdev),
[fastcore](https://github.com/fastai/fastcore), and
[execnb](https://github.com/fastai/execnb).
Happy Coding!!!
<div class="alert alert-info">
<b>Note:</b> `buildNanoGPT` is currently Work in Progress (WIP).
</div>
Raw data
{
"_id": null,
"home_page": "https://github.com/hdocmsu/buildNanoGPT/",
"name": "buildNanoGPT",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "nbdev",
"author": "Hung Do, PhD",
"author_email": "clinicalcollaborations@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/da/3d/63f9a14039ca8eda62723556d94ceaa8bc3a7c199ca8d5757bf0212748aa/buildnanogpt-0.1.1.tar.gz",
"platform": null,
"description": "# buildNanoGPT\n\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\n> `buildNanoGPT` is developed based on Andrej Karpathy\u2019s\n> [build-nanoGPT](https://github.com/karpathy/build-nanoGPT) repo and\n> [Let\u2019s reproduce GPT-2\n> (124M)](https://www.youtube.com/watch?v=l8pRSuU81PU) with added notes\n> and details for teaching purposes using\n> [nbdev](https://nbdev.fast.ai/), which enables package development,\n> testing, documentation, and dissemination all in one place - Jupyter\n> Notebook or Visual Studio Code Jupyter Notebook in my case \ud83d\ude04.\n\n## Literate Programming\n\n`buildNanoGPT`\n\n``` mermaid\nflowchart LR\n A(Andrej's build-nanoGPT) --> C((Combination))\n B(Jeremy's nbdev) --> C\n C -->|Literate Programming| D(buildNanoGPT)\n```\n\n`micrograd2023`\n\n<img src='media/literate_programming.svg' width=100% height=auto >\n\n## Disclaimers\n\n`buildNanoGPT` is written based on [Andrej\nKarpathy](https://karpathy.ai/)\u2019s github repo named\n[build-nanoGPT](https://github.com/karpathy/makemore) and his [\u201cNeural\nNetworks: Zero to\nHero\u201d](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)\nlecture series. Specifically the lecture called [Let\u2019s reproduce GPT-2\n(124M)](https://www.youtube.com/watch?v=l8pRSuU81PU).\n\nAndrej is the man who needs no introduction in the field of Deep\nLearning. He released a series of lectures called [Neural Network: Zero\nto Hero](https://karpathy.ai/zero-to-hero.html), which I found extremely\neducational and practical. I am reviewing the lectures and creating\nnotes for myself and for teaching purposes.\n\n`buildNanoGPT` was written using [nbdev](https://nbdev.fast.ai/), which\nwas developed by [Jeremy Howard](https://jeremy.fast.ai/), the man who\nalso needs no introduction in the field of Deep Learning. Jeremy created\n`fastai` Deep Learning software [library](https://docs.fast.ai/) and\n[Courses](https://course.fast.ai/) that are extremely influential. I\nhighly recommend `fastai` if you are interested in starting your journey\nand learning with ML and DL.\n\n`nbdev` is a powerful tool that can be used to efficiently develop,\nbuild, test, document, and distribute software packages all in one\nplace, Jupyter Notebook or Jupyter Notebooks in VS Code, which I am\nusing.\n\nIf you study lectures by Andrej and Jeremy you will probably notice that\nthey are both great educators and utilize both top-down and bottom-up\napproaches in their teaching, but Andrej predominantly uses *bottom-up*\napproach while Jeremy predominantly uses *top-down* one. I personally\nfascinated by both educators and found values from both of them and hope\nyou are too!\n\n## Usage\n\n### Prepare FineWeb-Edu-10B data\n\n``` python\nfrom buildNanoGPT import data\nimport tiktoken\nimport numpy as np\n```\n\n``` python\nenc = tiktoken.get_encoding(\"gpt2\")\neot = enc._special_tokens['<|endoftext|>'] # end of text token\neot\n```\n\n 50256\n\n``` python\nt_ref = [eot]\nt_ref.extend(enc.encode(\"Hello, world!\"))\nt_ref = np.array(t_ref).astype(np.uint16)\nt_ref\n```\n\n array([50256, 15496, 11, 995, 0], dtype=uint16)\n\n``` python\nt_ref = [eot]\nt_ref.extend(enc.encode(\"Hello, world!\"))\nt_ref = np.array(t_ref).astype(np.int32)\nt_ref\n```\n\n array([50256, 15496, 11, 995, 0], dtype=int32)\n\n``` python\ndoc = {\"text\":\"Hello, world!\"}\nt_test = data.tokenize(doc)\nt_test\n```\n\n array([50256, 15496, 11, 995, 0], dtype=uint16)\n\n``` python\nassert np.all(t_ref == t_test)\n```\n\n``` python\n# Download and Prepare the FineWeb-Edu-10B sample Data\ndata.edu_fineweb10B_prep(is_test=True)\n```\n\n Resolving data files: 0%| | 0/1630 [00:00<?, ?it/s]\n\n Loading dataset shards: 0%| | 0/98 [00:00<?, ?it/s]\n\n 'Hello from `prepare_edu_fineweb10B()`! if you want to download the dataset, set is_test=False and run again.'\n\n### Prepare HellaSwag Evaluation data\n\n``` python\ndata.hellaswag_val_prep(is_test=True)\n```\n\n 'Hello from `hellaswag_val_prep()`! if you want to download the dataset, set is_test=False and run again.'\n\n### Load Pre-trained Weight\n\n``` python\nfrom buildNanoGPT.model import GPT, GPTConfig\nfrom buildNanoGPT.train import DDPConfig, TrainingConfig, generate_text\nimport tiktoken\nimport torch\nfrom torch.nn import functional as F\n```\n\n``` python\nmaster_process = True\nmodel = GPT.from_pretrained(\"gpt2\", master_process)\n```\n\n loading weights from pretrained gpt: gpt2\n\n``` python\nenc = tiktoken.get_encoding('gpt2')\n```\n\n``` python\nddp_cf = DDPConfig()\nmodel.to(ddp_cf.device)\n```\n\n using device: cuda\n\n GPT(\n (transformer): ModuleDict(\n (wte): Embedding(50257, 768)\n (wpe): Embedding(1024, 768)\n (h): ModuleList(\n (0-11): 12 x Block(\n (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n (attn): CausalSelfAttention(\n (c_attn): Linear(in_features=768, out_features=2304, bias=True)\n (c_proj): Linear(in_features=768, out_features=768, bias=True)\n )\n (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n (mlp): MLP(\n (c_fc): Linear(in_features=768, out_features=3072, bias=True)\n (gelu): GELU(approximate='tanh')\n (c_proj): Linear(in_features=3072, out_features=768, bias=True)\n )\n )\n )\n (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n )\n (lm_head): Linear(in_features=768, out_features=50257, bias=False)\n )\n\n``` python\ngenerate_text(model, enc, ddp_cf)\n```\n\n rank 0 sample 0: Hello, I'm a language model, and I do not want to use some third-party file manager I used on my laptop. It would probably be easier\n rank 0 sample 1: Hello, I'm a language model, not a problem solver. I should be writing. In the first book, I was in the trouble of proving that\n rank 0 sample 2: Hello, I'm a language model, not a script,\" he said.\n\n Banks and regulators will likely be wary of such a move, but for\n rank 0 sample 3: Hello, I'm a language model, you must understand this.\n\n So what really happened?\n\n This article would be too short and concise. That\n\n### Training\n\n``` python\n# either running 03_train.ipynb or short-cut by running train script from the buildNanoGPT package\nfrom buildNanoGPT import train\n```\n\n using device: cuda\n total desired batch size: 524288\n => calculated gradient accumulation steps: 32\n found 99 shards for split train\n found 1 shards for split val\n num decayed parameter tensors: 50, with 124,354,560 parameters\n num non-decayed parameter tensors: 98, with 121,344 parameters\n using fused AdamW: True\n validation loss: 10.9834\n HellaSwag accuracy: 2534/10042=0.2523\n step 0 | loss: 10.981724 | lr 6.0000e-06 | norm: 15.4339 | dt: 82809.98ms | tok/sec: 6331.22\n step 1 | loss: 10.655205 | lr 1.2000e-05 | norm: 12.4931 | dt: 10492.83ms | tok/sec: 49966.29\n step 2 | loss: 10.274603 | lr 1.8000e-05 | norm: 7.7501 | dt: 10522.88ms | tok/sec: 49823.61\n step 3 | loss: 10.004156 | lr 2.4000e-05 | norm: 5.2698 | dt: 10481.91ms | tok/sec: 50018.35\n step 4 | loss: 9.833108 | lr 3.0000e-05 | norm: 3.6179 | dt: 10495.18ms | tok/sec: 49955.14\n step 5 | loss: 9.711222 | lr 3.6000e-05 | norm: 2.7871 | dt: 10484.25ms | tok/sec: 50007.21\n step 6 | loss: 9.642426 | lr 4.2000e-05 | norm: 2.4048 | dt: 10679.06ms | tok/sec: 49094.97\n step 7 | loss: 9.612312 | lr 4.8000e-05 | norm: 2.3183 | dt: 10555.78ms | tok/sec: 49668.32\n step 8 | loss: 9.558184 | lr 5.4000e-05 | norm: 2.2464 | dt: 10685.39ms | tok/sec: 49065.86\n step 9 | loss: 9.526472 | lr 6.0000e-05 | norm: 2.2171 | dt: 10548.39ms | tok/sec: 49703.14\n step 10 | loss: 9.463450 | lr 6.6000e-05 | norm: 2.1546 | dt: 10559.73ms | tok/sec: 49649.78\n step 11 | loss: 9.413282 | lr 7.2000e-05 | norm: 2.1401 | dt: 10495.94ms | tok/sec: 49951.49\n step 12 | loss: 9.340552 | lr 7.8000e-05 | norm: 2.0149 | dt: 10668.78ms | tok/sec: 49142.26\n step 13 | loss: 9.278631 | lr 8.4000e-05 | norm: 1.9368 | dt: 10605.16ms | tok/sec: 49437.05\n step 14 | loss: 9.159446 | lr 9.0000e-05 | norm: 1.9737 | dt: 10701.77ms | tok/sec: 48990.76\n step 15 | loss: 9.111786 | lr 9.6000e-05 | norm: 3.0525 | dt: 10732.83ms | tok/sec: 48849.00\n step 16 | loss: 9.029915 | lr 1.0200e-04 | norm: 1.9619 | dt: 10790.65ms | tok/sec: 48587.23\n step 17 | loss: 8.937255 | lr 1.0800e-04 | norm: 1.8786 | dt: 10621.46ms | tok/sec: 49361.22\n step 18 | loss: 8.955976 | lr 1.1400e-04 | norm: 2.0179 | dt: 10545.33ms | tok/sec: 49717.53\n step 19 | loss: 8.888343 | lr 1.2000e-04 | norm: 1.9142 | dt: 10598.08ms | tok/sec: 49470.11\n step 20 | loss: 8.672051 | lr 1.2600e-04 | norm: 1.7543 | dt: 10730.04ms | tok/sec: 48861.68\n step 21 | loss: 8.556496 | lr 1.3200e-04 | norm: 1.6246 | dt: 10822.08ms | tok/sec: 48446.13\n step 22 | loss: 8.463942 | lr 1.3800e-04 | norm: 1.4898 | dt: 10733.11ms | tok/sec: 48847.72\n step 23 | loss: 8.389053 | lr 1.4400e-04 | norm: 1.9412 | dt: 10555.51ms | tok/sec: 49669.61\n step 24 | loss: 8.257857 | lr 1.5000e-04 | norm: 2.0539 | dt: 10732.67ms | tok/sec: 48849.75\n step 25 | loss: 8.128786 | lr 1.5600e-04 | norm: 1.4269 | dt: 10609.93ms | tok/sec: 49414.84\n step 26 | loss: 8.098352 | lr 1.6200e-04 | norm: 2.0206 | dt: 10487.59ms | tok/sec: 49991.30\n step 27 | loss: 7.961097 | lr 1.6800e-04 | norm: 1.2978 | dt: 10578.22ms | tok/sec: 49562.95\n step 28 | loss: 7.884172 | lr 1.7400e-04 | norm: 1.2289 | dt: 10497.51ms | tok/sec: 49944.04\n step 29 | loss: 7.765845 | lr 1.8000e-04 | norm: 1.1969 | dt: 10724.78ms | tok/sec: 48885.65\n step 30 | loss: 7.821087 | lr 1.8600e-04 | norm: 1.0228 | dt: 10792.80ms | tok/sec: 48577.58\n step 31 | loss: 7.689835 | lr 1.9200e-04 | norm: 0.9216 | dt: 10752.80ms | tok/sec: 48758.30\n step 32 | loss: 7.641486 | lr 1.9800e-04 | norm: 0.8666 | dt: 10985.01ms | tok/sec: 47727.58\n step 33 | loss: 7.572504 | lr 2.0400e-04 | norm: 0.7996 | dt: 10684.39ms | tok/sec: 49070.46\n step 34 | loss: 7.429519 | lr 2.1000e-04 | norm: 0.7874 | dt: 10696.01ms | tok/sec: 49017.15\n step 35 | loss: 7.414855 | lr 2.1600e-04 | norm: 0.7272 | dt: 10580.76ms | tok/sec: 49551.08\n step 36 | loss: 7.393157 | lr 2.2200e-04 | norm: 0.8536 | dt: 10748.95ms | tok/sec: 48775.74\n step 37 | loss: 7.287198 | lr 2.2800e-04 | norm: 0.5487 | dt: 10921.08ms | tok/sec: 48006.98\n step 38 | loss: 7.252760 | lr 2.3400e-04 | norm: 0.4738 | dt: 10716.44ms | tok/sec: 48923.69\n step 39 | loss: 7.292991 | lr 2.4000e-04 | norm: 0.5769 | dt: 10659.42ms | tok/sec: 49185.43\n step 40 | loss: 7.251584 | lr 2.4600e-04 | norm: 0.9509 | dt: 10570.06ms | tok/sec: 49601.22\n step 41 | loss: 7.209351 | lr 2.5200e-04 | norm: 1.7773 | dt: 10611.45ms | tok/sec: 49407.78\n step 42 | loss: 7.140303 | lr 2.5800e-04 | norm: 0.9441 | dt: 10753.44ms | tok/sec: 48755.36\n step 43 | loss: 7.216593 | lr 2.6400e-04 | norm: 2.1513 | dt: 10632.68ms | tok/sec: 49309.09\n step 44 | loss: 7.155683 | lr 2.7000e-04 | norm: 1.3599 | dt: 10780.88ms | tok/sec: 48631.27\n step 45 | loss: 7.159153 | lr 2.7600e-04 | norm: 1.1990 | dt: 10722.27ms | tok/sec: 48897.11\n step 46 | loss: 7.126624 | lr 2.8200e-04 | norm: 0.8272 | dt: 10791.48ms | tok/sec: 48583.50\n step 47 | loss: 7.190242 | lr 2.8800e-04 | norm: 0.9578 | dt: 10718.49ms | tok/sec: 48914.35\n step 48 | loss: 7.194102 | lr 2.9400e-04 | norm: 0.7273 | dt: 10651.67ms | tok/sec: 49221.22\n step 49 | loss: 7.113352 | lr 3.0000e-04 | norm: 1.1239 | dt: 10732.94ms | tok/sec: 48848.51\n step 50 | loss: 7.169769 | lr 3.0600e-04 | norm: 1.0528 | dt: 10706.81ms | tok/sec: 48967.72\n step 51 | loss: 7.103631 | lr 3.1200e-04 | norm: 1.0537 | dt: 10826.62ms | tok/sec: 48425.82\n step 52 | loss: 7.092214 | lr 3.1800e-04 | norm: 0.7355 | dt: 10777.80ms | tok/sec: 48645.18\n step 53 | loss: 7.021073 | lr 3.2400e-04 | norm: 0.8493 | dt: 10907.12ms | tok/sec: 48068.41\n step 54 | loss: 7.030515 | lr 3.3000e-04 | norm: 0.7924 | dt: 10822.94ms | tok/sec: 48442.27\n step 55 | loss: 7.027347 | lr 3.3600e-04 | norm: 0.8563 | dt: 10661.62ms | tok/sec: 49175.26\n step 56 | loss: 7.007086 | lr 3.4200e-04 | norm: 1.2067 | dt: 10764.39ms | tok/sec: 48705.77\n step 57 | loss: 6.978011 | lr 3.4800e-04 | norm: 0.5606 | dt: 10967.17ms | tok/sec: 47805.22\n step 58 | loss: 6.919628 | lr 3.5400e-04 | norm: 1.3408 | dt: 10802.21ms | tok/sec: 48535.23\n step 59 | loss: 6.887385 | lr 3.6000e-04 | norm: 1.3971 | dt: 10907.45ms | tok/sec: 48066.97\n step 60 | loss: 6.879627 | lr 3.6600e-04 | norm: 0.7581 | dt: 10768.36ms | tok/sec: 48687.80\n step 61 | loss: 6.906055 | lr 3.7200e-04 | norm: 0.9657 | dt: 10613.11ms | tok/sec: 49400.03\n step 62 | loss: 6.795964 | lr 3.7800e-04 | norm: 0.6819 | dt: 10593.62ms | tok/sec: 49490.92\n step 63 | loss: 6.780255 | lr 3.8400e-04 | norm: 0.7485 | dt: 10719.51ms | tok/sec: 48909.68\n step 64 | loss: 6.767306 | lr 3.9000e-04 | norm: 0.7399 | dt: 10806.62ms | tok/sec: 48515.44\n step 65 | loss: 6.801779 | lr 3.9600e-04 | norm: 0.7439 | dt: 10609.56ms | tok/sec: 49416.58\n step 66 | loss: 6.721136 | lr 4.0200e-04 | norm: 0.5727 | dt: 10749.83ms | tok/sec: 48771.73\n step 67 | loss: 6.750595 | lr 4.0800e-04 | norm: 0.7310 | dt: 10711.53ms | tok/sec: 48946.13\n step 68 | loss: 6.730660 | lr 4.1400e-04 | norm: 0.5052 | dt: 10772.71ms | tok/sec: 48668.16\n step 69 | loss: 6.631037 | lr 4.2000e-04 | norm: 0.6577 | dt: 10736.56ms | tok/sec: 48832.04\n step 70 | loss: 6.612390 | lr 4.2600e-04 | norm: 0.6208 | dt: 10598.25ms | tok/sec: 49469.31\n step 71 | loss: 6.643014 | lr 4.3200e-04 | norm: 0.6751 | dt: 10712.97ms | tok/sec: 48939.57\n step 72 | loss: 6.602534 | lr 4.3800e-04 | norm: 0.8274 | dt: 10685.25ms | tok/sec: 49066.50\n step 73 | loss: 6.606695 | lr 4.4400e-04 | norm: 1.0497 | dt: 10784.33ms | tok/sec: 48615.72\n step 74 | loss: 6.532132 | lr 4.5000e-04 | norm: 0.9483 | dt: 11051.53ms | tok/sec: 47440.31\n step 75 | loss: 6.571723 | lr 4.5600e-04 | norm: 0.5493 | dt: 10943.98ms | tok/sec: 47906.50\n step 76 | loss: 6.519442 | lr 4.6200e-04 | norm: 0.6364 | dt: 11138.90ms | tok/sec: 47068.20\n step 77 | loss: 6.553431 | lr 4.6800e-04 | norm: 0.6423 | dt: 10943.91ms | tok/sec: 47906.81\n step 78 | loss: 6.525961 | lr 4.7400e-04 | norm: 0.4541 | dt: 10733.66ms | tok/sec: 48845.21\n step 79 | loss: 6.474160 | lr 4.8000e-04 | norm: 0.6690 | dt: 10748.03ms | tok/sec: 48779.93\n step 80 | loss: 6.481711 | lr 4.8600e-04 | norm: 0.5859 | dt: 10679.49ms | tok/sec: 49093.00\n step 81 | loss: 6.486966 | lr 4.9200e-04 | norm: 0.6897 | dt: 10656.78ms | tok/sec: 49197.58\n step 82 | loss: 6.430150 | lr 4.9800e-04 | norm: 0.6284 | dt: 10426.83ms | tok/sec: 50282.59\n step 83 | loss: 6.387268 | lr 5.0400e-04 | norm: 0.5746 | dt: 10644.15ms | tok/sec: 49255.97\n step 84 | loss: 6.405340 | lr 5.1000e-04 | norm: 0.5523 | dt: 10856.28ms | tok/sec: 48293.53\n step 85 | loss: 6.371199 | lr 5.1600e-04 | norm: 0.6764 | dt: 10573.15ms | tok/sec: 49586.76\n step 86 | loss: 6.367082 | lr 5.2200e-04 | norm: 0.7355 | dt: 10731.52ms | tok/sec: 48854.94\n step 87 | loss: 6.404164 | lr 5.2800e-04 | norm: 0.7907 | dt: 10878.82ms | tok/sec: 48193.45\n step 88 | loss: 6.383866 | lr 5.3400e-04 | norm: 0.7472 | dt: 10855.23ms | tok/sec: 48298.20\n step 89 | loss: 6.428278 | lr 5.4000e-04 | norm: 0.7306 | dt: 10751.87ms | tok/sec: 48762.51\n step 90 | loss: 6.355624 | lr 5.4600e-04 | norm: 0.6458 | dt: 10799.97ms | tok/sec: 48545.31\n step 91 | loss: 6.356147 | lr 5.5200e-04 | norm: 0.5809 | dt: 10756.22ms | tok/sec: 48742.76\n step 92 | loss: 6.407714 | lr 5.5800e-04 | norm: 0.5222 | dt: 10799.32ms | tok/sec: 48548.25\n step 93 | loss: 6.488331 | lr 5.6400e-04 | norm: 0.8362 | dt: 10773.78ms | tok/sec: 48663.34\n step 94 | loss: 6.541770 | lr 5.7000e-04 | norm: 1.7085 | dt: 10864.89ms | tok/sec: 48255.23\n step 95 | loss: 6.541307 | lr 5.7600e-04 | norm: 1.3723 | dt: 10788.27ms | tok/sec: 48597.98\n step 96 | loss: 6.460635 | lr 5.8200e-04 | norm: 0.7749 | dt: 10840.03ms | tok/sec: 48365.92\n step 97 | loss: 6.439204 | lr 5.8800e-04 | norm: 1.0601 | dt: 10847.54ms | tok/sec: 48332.45\n step 98 | loss: 6.489636 | lr 5.9400e-04 | norm: 1.1039 | dt: 10751.69ms | tok/sec: 48763.31\n step 99 | loss: 6.463543 | lr 6.0000e-04 | norm: 1.1220 | dt: 11026.37ms | tok/sec: 47548.54\n step 100 | loss: 6.475557 | lr 6.0000e-04 | norm: 0.8641 | dt: 10706.05ms | tok/sec: 48971.19\n step 101 | loss: 6.403978 | lr 5.9987e-04 | norm: 0.6312 | dt: 10799.40ms | tok/sec: 48547.87\n step 102 | loss: 6.399425 | lr 5.9947e-04 | norm: 0.9644 | dt: 10571.53ms | tok/sec: 49594.33\n step 103 | loss: 6.291117 | lr 5.9880e-04 | norm: 0.8341 | dt: 10589.38ms | tok/sec: 49510.71\n step 104 | loss: 6.395230 | lr 5.9787e-04 | norm: 0.6783 | dt: 10603.40ms | tok/sec: 49445.27\n step 105 | loss: 6.381511 | lr 5.9668e-04 | norm: 0.5386 | dt: 10608.30ms | tok/sec: 49422.43\n step 106 | loss: 6.345720 | lr 5.9522e-04 | norm: 0.4796 | dt: 10714.76ms | tok/sec: 48931.39\n step 107 | loss: 6.295020 | lr 5.9350e-04 | norm: 0.5316 | dt: 10712.39ms | tok/sec: 48942.19\n step 108 | loss: 6.354154 | lr 5.9152e-04 | norm: 0.4104 | dt: 10863.69ms | tok/sec: 48260.57\n step 109 | loss: 6.346787 | lr 5.8928e-04 | norm: 0.5001 | dt: 10882.25ms | tok/sec: 48178.25\n step 110 | loss: 6.309251 | lr 5.8679e-04 | norm: 0.4883 | dt: 10608.02ms | tok/sec: 49423.72\n step 111 | loss: 6.281376 | lr 5.8404e-04 | norm: 0.5975 | dt: 10248.73ms | tok/sec: 51156.40\n step 112 | loss: 6.262320 | lr 5.8104e-04 | norm: 0.4393 | dt: 9123.81ms | tok/sec: 57463.69\n step 113 | loss: 6.289036 | lr 5.7779e-04 | norm: 0.4367 | dt: 9033.14ms | tok/sec: 58040.48\n step 114 | loss: 6.315429 | lr 5.7430e-04 | norm: 0.5169 | dt: 9021.78ms | tok/sec: 58113.61\n step 115 | loss: 6.286012 | lr 5.7057e-04 | norm: 0.5163 | dt: 9020.69ms | tok/sec: 58120.62\n step 116 | loss: 6.218066 | lr 5.6660e-04 | norm: 0.4813 | dt: 9021.61ms | tok/sec: 58114.68\n step 117 | loss: 6.163318 | lr 5.6240e-04 | norm: 0.5648 | dt: 9018.39ms | tok/sec: 58135.44\n step 118 | loss: 6.194816 | lr 5.5797e-04 | norm: 0.7243 | dt: 9019.76ms | tok/sec: 58126.63\n step 119 | loss: 6.205301 | lr 5.5331e-04 | norm: 0.5606 | dt: 9019.06ms | tok/sec: 58131.12\n step 120 | loss: 6.187188 | lr 5.4843e-04 | norm: 0.5205 | dt: 9021.12ms | tok/sec: 58117.87\n step 121 | loss: 6.149425 | lr 5.4334e-04 | norm: 0.5132 | dt: 9019.32ms | tok/sec: 58129.44\n step 122 | loss: 6.156881 | lr 5.3804e-04 | norm: 0.4721 | dt: 9030.19ms | tok/sec: 58059.47\n step 123 | loss: 6.160114 | lr 5.3253e-04 | norm: 0.5163 | dt: 9019.01ms | tok/sec: 58131.42\n step 124 | loss: 6.161614 | lr 5.2682e-04 | norm: 0.3730 | dt: 9021.48ms | tok/sec: 58115.54\n step 125 | loss: 6.162668 | lr 5.2092e-04 | norm: 0.4222 | dt: 9022.97ms | tok/sec: 58105.90\n step 126 | loss: 6.142958 | lr 5.1483e-04 | norm: 0.3661 | dt: 9025.08ms | tok/sec: 58092.34\n step 127 | loss: 6.107336 | lr 5.0855e-04 | norm: 0.3189 | dt: 9022.45ms | tok/sec: 58109.29\n step 128 | loss: 6.059753 | lr 5.0210e-04 | norm: 0.3107 | dt: 9017.19ms | tok/sec: 58143.18\n step 129 | loss: 6.064310 | lr 4.9548e-04 | norm: 0.3808 | dt: 9027.10ms | tok/sec: 58079.35\n step 130 | loss: 6.106601 | lr 4.8870e-04 | norm: 0.3701 | dt: 9025.69ms | tok/sec: 58088.43\n step 131 | loss: 6.069602 | lr 4.8176e-04 | norm: 0.3277 | dt: 9014.09ms | tok/sec: 58163.13\n step 132 | loss: 6.078692 | lr 4.7467e-04 | norm: 0.3552 | dt: 9023.25ms | tok/sec: 58104.11\n step 133 | loss: 5.993310 | lr 4.6744e-04 | norm: 0.4006 | dt: 9025.95ms | tok/sec: 58086.72\n step 134 | loss: 6.013237 | lr 4.6007e-04 | norm: 0.4799 | dt: 9018.29ms | tok/sec: 58136.08\n step 135 | loss: 6.053710 | lr 4.5258e-04 | norm: 0.4524 | dt: 9032.08ms | tok/sec: 58047.32\n step 136 | loss: 6.033798 | lr 4.4496e-04 | norm: 0.3394 | dt: 9026.17ms | tok/sec: 58085.34\n step 137 | loss: 6.055409 | lr 4.3723e-04 | norm: 0.3845 | dt: 9021.99ms | tok/sec: 58112.23\n step 138 | loss: 6.007836 | lr 4.2939e-04 | norm: 0.4304 | dt: 9036.39ms | tok/sec: 58019.64\n step 139 | loss: 6.109036 | lr 4.2146e-04 | norm: 0.3833 | dt: 9019.26ms | tok/sec: 58129.83\n step 140 | loss: 6.218612 | lr 4.1343e-04 | norm: 0.3712 | dt: 9023.31ms | tok/sec: 58103.75\n step 141 | loss: 6.109329 | lr 4.0533e-04 | norm: 0.3751 | dt: 9032.97ms | tok/sec: 58041.61\n step 142 | loss: 6.157863 | lr 3.9715e-04 | norm: 0.3973 | dt: 9026.04ms | tok/sec: 58086.18\n step 143 | loss: 6.105368 | lr 3.8890e-04 | norm: 0.4718 | dt: 9024.10ms | tok/sec: 58098.66\n step 144 | loss: 6.112780 | lr 3.8059e-04 | norm: 0.5495 | dt: 9024.62ms | tok/sec: 58095.32\n step 145 | loss: 6.094649 | lr 3.7224e-04 | norm: 0.4203 | dt: 9219.34ms | tok/sec: 56868.26\n step 146 | loss: 6.120586 | lr 3.6384e-04 | norm: 0.3370 | dt: 9118.57ms | tok/sec: 57496.73\n step 147 | loss: 6.128690 | lr 3.5541e-04 | norm: 0.3505 | dt: 9019.72ms | tok/sec: 58126.89\n step 148 | loss: 6.126965 | lr 3.4695e-04 | norm: 0.3768 | dt: 9027.35ms | tok/sec: 58077.76\n step 149 | loss: 6.087430 | lr 3.3848e-04 | norm: 0.2887 | dt: 9014.19ms | tok/sec: 58162.52\n step 150 | loss: 6.099020 | lr 3.3000e-04 | norm: 0.3975 | dt: 9018.30ms | tok/sec: 58135.99\n step 151 | loss: 6.011409 | lr 3.2152e-04 | norm: 0.3445 | dt: 9018.43ms | tok/sec: 58135.15\n step 152 | loss: 6.053518 | lr 3.1305e-04 | norm: 0.2765 | dt: 9021.84ms | tok/sec: 58113.21\n step 153 | loss: 6.096207 | lr 3.0459e-04 | norm: 0.3268 | dt: 9022.89ms | tok/sec: 58106.44\n step 154 | loss: 6.014778 | lr 2.9616e-04 | norm: 0.4205 | dt: 9023.55ms | tok/sec: 58102.17\n step 155 | loss: 5.993350 | lr 2.8776e-04 | norm: 0.2954 | dt: 9016.75ms | tok/sec: 58146.04\n step 156 | loss: 6.027627 | lr 2.7941e-04 | norm: 0.3306 | dt: 9031.58ms | tok/sec: 58050.53\n step 157 | loss: 6.092584 | lr 2.7110e-04 | norm: 0.3101 | dt: 9025.19ms | tok/sec: 58091.62\n step 158 | loss: 6.105118 | lr 2.6285e-04 | norm: 0.2992 | dt: 9019.38ms | tok/sec: 58129.02\n step 159 | loss: 6.017125 | lr 2.5467e-04 | norm: 0.3080 | dt: 9016.80ms | tok/sec: 58145.71\n step 160 | loss: 5.959670 | lr 2.4657e-04 | norm: 0.2711 | dt: 9024.38ms | tok/sec: 58096.83\n step 161 | loss: 6.058784 | lr 2.3854e-04 | norm: 0.2906 | dt: 9024.04ms | tok/sec: 58099.06\n step 162 | loss: 5.958908 | lr 2.3061e-04 | norm: 0.2375 | dt: 9025.14ms | tok/sec: 58091.94\n step 163 | loss: 5.928731 | lr 2.2277e-04 | norm: 0.3086 | dt: 9024.43ms | tok/sec: 58096.51\n step 164 | loss: 5.932847 | lr 2.1504e-04 | norm: 0.2456 | dt: 9031.44ms | tok/sec: 58051.43\n step 165 | loss: 5.987537 | lr 2.0742e-04 | norm: 0.3180 | dt: 9034.98ms | tok/sec: 58028.72\n step 166 | loss: 5.846995 | lr 1.9993e-04 | norm: 0.3659 | dt: 9028.83ms | tok/sec: 58068.20\n step 167 | loss: 5.949950 | lr 1.9256e-04 | norm: 0.3790 | dt: 9024.60ms | tok/sec: 58095.40\n step 168 | loss: 5.925792 | lr 1.8533e-04 | norm: 0.2998 | dt: 9023.44ms | tok/sec: 58102.89\n step 169 | loss: 5.927565 | lr 1.7824e-04 | norm: 0.3140 | dt: 9021.85ms | tok/sec: 58113.12\n step 170 | loss: 5.913670 | lr 1.7130e-04 | norm: 0.3304 | dt: 9031.75ms | tok/sec: 58049.43\n step 171 | loss: 5.944331 | lr 1.6452e-04 | norm: 0.2440 | dt: 9029.87ms | tok/sec: 58061.54\n step 172 | loss: 5.913747 | lr 1.5790e-04 | norm: 0.3646 | dt: 9022.08ms | tok/sec: 58111.66\n step 173 | loss: 5.894815 | lr 1.5145e-04 | norm: 0.2861 | dt: 9027.03ms | tok/sec: 58079.77\n step 174 | loss: 5.846126 | lr 1.4517e-04 | norm: 0.2546 | dt: 9021.01ms | tok/sec: 58118.55\n step 175 | loss: 5.903183 | lr 1.3908e-04 | norm: 0.2809 | dt: 9023.06ms | tok/sec: 58105.36\n step 176 | loss: 5.857369 | lr 1.3318e-04 | norm: 0.2143 | dt: 9018.74ms | tok/sec: 58133.16\n step 177 | loss: 5.902529 | lr 1.2747e-04 | norm: 0.2514 | dt: 9017.04ms | tok/sec: 58144.11\n step 178 | loss: 5.833840 | lr 1.2196e-04 | norm: 0.2743 | dt: 9027.74ms | tok/sec: 58075.25\n step 179 | loss: 5.825159 | lr 1.1666e-04 | norm: 0.2201 | dt: 9018.76ms | tok/sec: 58133.05\n step 180 | loss: 5.823802 | lr 1.1157e-04 | norm: 0.2582 | dt: 9026.48ms | tok/sec: 58083.35\n step 181 | loss: 5.850857 | lr 1.0669e-04 | norm: 0.2286 | dt: 9032.15ms | tok/sec: 58046.85\n step 182 | loss: 5.852230 | lr 1.0203e-04 | norm: 0.2073 | dt: 9025.26ms | tok/sec: 58091.20\n step 183 | loss: 5.848113 | lr 9.7600e-05 | norm: 0.2366 | dt: 9030.65ms | tok/sec: 58056.49\n step 184 | loss: 5.875956 | lr 9.3397e-05 | norm: 0.2153 | dt: 9036.46ms | tok/sec: 58019.15\n step 185 | loss: 5.925734 | lr 8.9428e-05 | norm: 0.2497 | dt: 9028.40ms | tok/sec: 58070.95\n step 186 | loss: 5.951926 | lr 8.5697e-05 | norm: 0.2276 | dt: 9026.99ms | tok/sec: 58080.07\n step 187 | loss: 6.008245 | lr 8.2206e-05 | norm: 0.2261 | dt: 9028.58ms | tok/sec: 58069.83\n step 188 | loss: 5.967976 | lr 7.8960e-05 | norm: 0.2432 | dt: 9014.70ms | tok/sec: 58159.21\n step 189 | loss: 5.948523 | lr 7.5962e-05 | norm: 0.2389 | dt: 9028.44ms | tok/sec: 58070.69\n step 190 | loss: 5.992687 | lr 7.3215e-05 | norm: 0.2226 | dt: 9238.47ms | tok/sec: 56750.51\n step 191 | loss: 5.945471 | lr 7.0721e-05 | norm: 0.2415 | dt: 9019.78ms | tok/sec: 58126.50\n step 192 | loss: 5.965812 | lr 6.8483e-05 | norm: 0.2324 | dt: 9027.71ms | tok/sec: 58075.41\n step 193 | loss: 5.967308 | lr 6.6502e-05 | norm: 0.2536 | dt: 9022.99ms | tok/sec: 58105.79\n step 194 | loss: 5.894364 | lr 6.4782e-05 | norm: 0.2591 | dt: 9029.48ms | tok/sec: 58064.04\n step 195 | loss: 5.926851 | lr 6.3324e-05 | norm: 0.2056 | dt: 9032.62ms | tok/sec: 58043.81\n step 196 | loss: 5.889875 | lr 6.2129e-05 | norm: 0.2426 | dt: 9032.17ms | tok/sec: 58046.73\n step 197 | loss: 5.931971 | lr 6.1198e-05 | norm: 0.2178 | dt: 9028.09ms | tok/sec: 58072.95\n step 198 | loss: 5.929649 | lr 6.0533e-05 | norm: 0.2386 | dt: 9027.39ms | tok/sec: 58077.47\n validation loss: 5.9230\n HellaSwag accuracy: 2440/10042=0.2430\n rank 0 sample 0: Hello, I'm a language model, and the new student, we don't give for our study. A person to the child from all, I have no\n rank 0 sample 1: Hello, I'm a language model, then go to be the original work without the most simple idea. But now is a good idea is very good topic for\n rank 0 sample 2: Hello, I'm a language model, the two number of light for the time, this post, is to the same amount of the same same time. Some\n rank 0 sample 3: Hello, I'm a language model, which allows the first, or the data is by the current application.\n A video in the text or the same and\n step 199 | loss: 5.896696 | lr 6.0133e-05 | norm: 0.2305 | dt: 77571.04ms | tok/sec: 6758.81\n\n### Visualize the Loss\n\n``` python\nfrom buildNanoGPT.viz import plot_log\n```\n\n``` python\nplot_log(log_file='log/log_6500steps.txt', sz='124M')\n```\n\n Min Train Loss: 2.997356\n Min Validation Loss: 3.275\n Max Hellaswag eval: 0.2782\n\n![](index_files/figure-commonmark/cell-18-output-2.png)\n\n## How to install\n\nThe [buildNanoGPT](https://pypi.org/project/buildNanoGPT/) package was\nuploaded to [PyPI](https://pypi.org/) and can be easily installed using\nthe below command.\n\n`pip install buildNanoGPT`\n\n### Developer install\n\nIf you want to develop `buildNanoGPT` yourself, please use an editable\ninstallation.\n\n`git clone https://github.com/hdocmsu/buildNanoGPT.git`\n\n`pip install -e \"buildNanoGPT[dev]\"`\n\nYou also need to use an editable installation of\n[nbdev](https://github.com/fastai/nbdev),\n[fastcore](https://github.com/fastai/fastcore), and\n[execnb](https://github.com/fastai/execnb).\n\nHappy Coding!!!\n\n<div class=\"alert alert-info\">\n\n<b>Note:</b> `buildNanoGPT` is currently Work in Progress (WIP).\n\n</div>\n",
"bugtrack_url": null,
"license": "Apache Software License 2.0",
"summary": "A template for nbdev-based project",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/hdocmsu/buildNanoGPT/"
},
"split_keywords": [
"nbdev"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c2ac3c0182b6bc507bda19a4c3d7f9208bbea84618744bfa89ccab2febaf28fc",
"md5": "3a37d694313b02d556eba41be16c8beb",
"sha256": "f4391422e658a3b8ce8b258e4088cbc3e2d698217f96dbad56cb5bf222489153"
},
"downloads": -1,
"filename": "buildNanoGPT-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3a37d694313b02d556eba41be16c8beb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 30213,
"upload_time": "2024-07-07T11:01:27",
"upload_time_iso_8601": "2024-07-07T11:01:27.331116Z",
"url": "https://files.pythonhosted.org/packages/c2/ac/3c0182b6bc507bda19a4c3d7f9208bbea84618744bfa89ccab2febaf28fc/buildNanoGPT-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "da3d63f9a14039ca8eda62723556d94ceaa8bc3a7c199ca8d5757bf0212748aa",
"md5": "343e7ef7c28a02a525ae6a500a468d47",
"sha256": "38feba5a29772961942dff8620a177912c4e1282cbe48ad8e5edd15f314ba7a1"
},
"downloads": -1,
"filename": "buildnanogpt-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "343e7ef7c28a02a525ae6a500a468d47",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 37551,
"upload_time": "2024-07-07T11:01:29",
"upload_time_iso_8601": "2024-07-07T11:01:29.065884Z",
"url": "https://files.pythonhosted.org/packages/da/3d/63f9a14039ca8eda62723556d94ceaa8bc3a7c199ca8d5757bf0212748aa/buildnanogpt-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-07 11:01:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "hdocmsu",
"github_project": "buildNanoGPT",
"github_not_found": true,
"lcname": "buildnanogpt"
}