TubeSum ← Transcribe a video

Let's build GPT: from scratch, in code, spelled out.

Transcribed Jun 15, 2026 Watch on YouTube ↗
Intermediate 58 min read For: Software developers and machine learning practitioners with basic Python and PyTorch experience.
7.4M
Views
160.3K
Likes
3.2K
Comments
1.2K
Dislikes
2.2%
📈 Moderate

AI Summary

This video provides a step-by-step tutorial on building a GPT-like Transformer language model from scratch using Python and PyTorch. The presenter explains the core concepts of the Transformer architecture, including self-attention, multi-head attention, and feed-forward networks, and implements them in code. The tutorial culminates in training a character-level language model on the Tiny Shakespeare dataset to generate Shakespeare-like text.

[00:00]
Introduction to ChatGPT and Language Models

ChatGPT is a probabilistic system that generates text based on prompts. It is a language model that models sequences of tokens.

[02:05]
Transformer Architecture Origin

The Transformer architecture was introduced in the 2017 paper 'Attention is All You Need'. GPT stands for 'Generatively Pre-trained Transformer'.

[03:53]
Tiny Shakespeare Dataset

The tutorial uses the Tiny Shakespeare dataset (1MB, concatenated works of Shakespeare) to train a character-level language model.

[05:43]
NanoGPT Repository

The code is available in the nanoGPT repository on GitHub, consisting of two files (model.py and train.py) of about 300 lines each.

[08:00]
Tokenization

Character-level tokenization is used: each character is mapped to an integer. The vocabulary size is 65 characters.

[14:28]
Data Batching

Data is processed in chunks (blocks) of size block_size. Each chunk contains multiple training examples (predicting next character given context).

[22:19]
Bigram Language Model

A simple bigram model is implemented first: it predicts the next character based solely on the current character using an embedding table.

[36:00]
Self-Attention Mechanism

Self-attention allows tokens to communicate with each other. It uses queries, keys, and values to compute weighted averages of past tokens.

[58:25]
Multi-Head Attention

Multiple self-attention heads run in parallel and their outputs are concatenated, allowing the model to attend to different types of information.

[85:00]
Feed-Forward Networks and Residual Connections

After self-attention, a feed-forward network (MLP) is applied per token. Residual connections and layer normalization help with training deep networks.

[99:34]
Scaling Up the Model

By increasing model size (embedding dimension 384, 6 heads, 6 layers, block size 256), the validation loss drops to 1.48, generating more coherent Shakespeare-like text.

[102:24]
Decoder-Only vs Encoder-Decoder

The implemented model is a decoder-only Transformer (like GPT), suitable for unconditional text generation. The original Transformer paper used an encoder-decoder for translation.

[108:55]
From Pre-training to ChatGPT

Pre-training trains a language model on internet text. Fine-tuning (e.g., with reinforcement learning from human feedback) aligns the model to be an assistant like ChatGPT.

This tutorial successfully builds a decoder-only Transformer from scratch, demonstrating the core components of GPT. The final model, trained on Tiny Shakespeare, generates plausible Shakespeare-like text, illustrating the power of the Transformer architecture.

Clickbait Check

95% Legit

"Title accurately describes the content: building GPT from scratch with code explanations."

Mentioned in this Video

Tutorial Checklist

1 07:53 Set up a Google Colab notebook and download the Tiny Shakespeare dataset.
2 08:37 Create a character-level tokenizer: build vocabulary of unique characters, create encoder/decoder mappings.
3 13:41 Split the dataset into training (90%) and validation (10%) sets.
4 14:28 Implement data batching: sample random chunks of size block_size from the training set, create input-target pairs.
5 22:19 Implement a bigram language model using nn.Embedding: map tokens to logits directly.
6 28:53 Add generation function: sample from the model iteratively to produce new text.
7 34:53 Train the bigram model using AdamW optimizer and cross-entropy loss.
8 42:13 Implement self-attention: compute queries, keys, values; apply masked softmax and weighted aggregation.
9 58:25 Implement multi-head attention: run multiple self-attention heads in parallel and concatenate outputs.
10 85:00 Add feed-forward network (MLP) with residual connections and layer normalization.
11 99:34 Scale up the model: increase embedding dimension, number of heads, layers, and block size; train on GPU.

Study Flashcards (10)

What does GPT stand for?

easy Click to reveal answer

Generatively Pre-trained Transformer

02:31

What is the name of the 2017 paper that introduced the Transformer architecture?

easy Click to reveal answer

Attention is All You Need

02:17

What is the vocabulary size of the character-level tokenizer used in the tutorial?

easy Click to reveal answer

65

09:04

What is the purpose of the 'block_size' hyperparameter?

medium Click to reveal answer

It defines the maximum context length for predictions.

15:04

How does a bigram language model predict the next token?

medium Click to reveal answer

It predicts based solely on the current token, using an embedding table.

24:12

What are the three vectors computed in self-attention?

easy Click to reveal answer

Query, key, and value.

64:00

Why is the attention score divided by the square root of the head size?

hard Click to reveal answer

To control variance and prevent softmax from becoming too peaky at initialization.

77:13

What is the difference between self-attention and cross-attention?

hard Click to reveal answer

In self-attention, keys, queries, and values come from the same source. In cross-attention, queries come from one source and keys/values from another.

75:49

What is the purpose of residual connections in deep Transformers?

medium Click to reveal answer

They allow gradients to flow directly from the loss to the input, improving optimization of deep networks.

88:31

What is the approximate number of parameters in the final model trained in the tutorial?

medium Click to reveal answer

10 million

109:32

💡 Key Takeaways

📊

Transformer Origin

The 2017 paper 'Attention is All You Need' introduced the Transformer, which became the foundation for GPT.

02:17
🔧

Self-Attention Communication

Self-attention allows tokens to communicate with each other via queries, keys, and values, enabling context-aware predictions.

36:00
🔧

Multi-Head Attention

Multiple attention heads run in parallel, allowing the model to attend to different types of information simultaneously.

58:25
⚖️

Residual Connections for Deep Networks

Residual connections enable training of deep Transformers by providing a gradient superhighway.

88:31
💡

Pre-training vs Fine-tuning

Pre-training on internet text produces a document completer; fine-tuning aligns it to be an assistant like ChatGPT.

108:55

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

What is ChatGPT?

45s

Opens with a relatable demo of ChatGPT writing a haiku, instantly hooking viewers curious about AI.

▶ Play Clip

The Transformer Paper That Changed AI

60s

Reveals the surprising origin of the Transformer architecture from a 'random machine translation paper' that took over AI.

▶ Play Clip

Building GPT from Scratch in Code

60s

Promises to build a GPT-like model from scratch, appealing to developers who want to understand the magic under the hood.

▶ Play Clip

The Math Trick Behind Self-Attention

60s

Explains the efficient matrix multiplication trick for averaging tokens, a key insight that makes Transformers work.

▶ Play Clip

From Random to Shakespeare: Training a Language Model

60s

Shows the dramatic improvement from garbage output to recognizable text, demonstrating the power of training.

▶ Play Clip

[00:00] hi everyone so by now you have probably

[00:02] heard of chat GPT it has taken the world

[00:04] and AI Community by storm and it is a

[00:07] system that allows you to interact with

[00:09] an AI and give it text based tasks so

[00:12] for example we can ask chat GPT to write

[00:15] us a small Hau about how important it is

[00:16] that people understand Ai and then they

[00:18] can use it to improve the world and make

[00:20] it more prosperous so when we run this

[00:23] AI knowledge brings prosperity for all

[00:25] to see Embrace its

[00:27] power okay not bad and so you could see

[00:29] that chpt went from left to right and

[00:32] generated all these words SE sort of

[00:35] sequentially now I asked it already the

[00:37] exact same prompt a little bit earlier

[00:39] and it generated a slightly different

[00:41] outcome ai's power to grow ignorance

[00:44] holds us back learn Prosperity weights

[00:47] so uh pretty good in both cases and

[00:49] slightly different so you can see that

[00:50] chat GPT is a probabilistic system and

[00:52] for any one prompt it can give us

[00:54] multiple answers sort of uh replying to

[00:57] it now this is just one example of a

[00:59] problem people have come up with many

[01:01] many examples and there are entire

[01:03] websites that index interactions with

[01:06] chpt and so many of them are quite

[01:08] humorous explain HTML to me like I'm a

[01:10] dog uh write release notes for chess 2

[01:14] write a note about Elon Musk buying a

[01:16] Twitter and so on so as an example uh

[01:20] please write a breaking news article

[01:21] about a leaf falling from a

[01:23] tree uh and a shocking turn of events a

[01:26] leaf has fallen from a tree in the local

[01:28] park Witnesses report that the leaf

[01:30] which was previously attached to a

[01:31] branch of a tree attached itself and

[01:33] fell to the ground very dramatic so you

[01:36] can see that this is a pretty remarkable

[01:37] system and it is what we call a language

[01:40] model uh because it um it models the

[01:43] sequence of words or characters or

[01:46] tokens more generally and it knows how

[01:49] sort of words follow each other in

[01:50] English language and so from its

[01:52] perspective what it is doing is it is

[01:55] completing the sequence so I give it the

[01:57] start of a sequence and it completes the

[02:00] sequence with the outcome and so it's a

[02:02] language model in that sense now I would

[02:05] like to focus on the under the hood of

[02:07] um under the hood components of what

[02:09] makes CH GPT work so what is the neural

[02:12] network under the hood that models the

[02:14] sequence of these words and that comes

[02:17] from this paper called attention is all

[02:19] you need in 2017 a landmark paper a

[02:23] landmark paper in AI that produced and

[02:25] proposed the Transformer

[02:27] architecture so GPT is uh short for

[02:31] generally generatively pre-trained

[02:33] Transformer so Transformer is the neuron

[02:35] nut that actually does all the heavy

[02:36] lifting under the hood it comes from

[02:39] this paper in 2017 now if you read this

[02:41] paper this uh reads like a pretty random

[02:44] machine translation paper and that's

[02:46] because I think the authors didn't fully

[02:47] anticipate the impact that the

[02:49] Transformer would have on the field and

[02:51] this architecture that they produced in

[02:52] the context of machine translation in

[02:54] their case actually ended up taking over

[02:57] uh the rest of AI in the next 5 years

[03:00] after and so this architecture with

[03:02] minor changes was copy pasted into a

[03:05] huge amount of applications in AI in

[03:07] more recent years and that includes at

[03:10] the core of chat GPT now we are not

[03:13] going to what I'd like to do now is I'd

[03:15] like to build out something like chat

[03:17] GPT but uh we're not going to be able to

[03:19] of course reproduce chat GPT this is a

[03:21] very serious production grade system it

[03:23] is trained on uh a good chunk of

[03:26] internet and then there's a lot of uh

[03:29] pre-training and fine-tuning stages to

[03:31] it and so it's very complicated what I'd

[03:33] like to focus on is just to train a

[03:36] Transformer based language model and in

[03:38] our case it's going to be a character

[03:40] level language model I still think that

[03:43] is uh very educational with respect to

[03:45] how these systems work so I don't want

[03:47] to train on the chunk of Internet we

[03:48] need a smaller data set in this case I

[03:51] propose that we work with uh my favorite

[03:53] toy data set it's called tiny

[03:55] Shakespeare and um what it is is

[03:57] basically it's a concatenation of all of

[03:59] the works of sh Shakespeare in my

[04:00] understanding and so this is all of

[04:02] Shakespeare in a single file uh this

[04:05] file is about 1 megab and it's just all

[04:07] of

[04:08] Shakespeare and what we are going to do

[04:10] now is we're going to basically model

[04:12] how these characters uh follow each

[04:14] other so for example given a chunk of

[04:16] these characters like this uh given some

[04:19] context of characters in the past the

[04:22] Transformer neural network will look at

[04:24] the characters that I've highlighted and

[04:26] is going to predict that g is likely to

[04:28] come next in the sequence and it's going

[04:30] to do that because we're going to train

[04:31] that Transformer on Shakespeare and it's

[04:34] just going to try to produce uh

[04:36] character sequences that look like this

[04:39] and in that process is going to model

[04:40] all the patterns inside this data so

[04:43] once we've trained the system i' just

[04:45] like to give you a preview we can

[04:47] generate infinite Shakespeare and of

[04:49] course it's a fake thing that looks kind

[04:51] of like

[04:53] Shakespeare

[04:55] um apologies for there's some Jank that

[04:59] I'm not able to resolve in in here but

[05:02] um you can see how this is going

[05:05] character by character and it's kind of

[05:07] like predicting Shakespeare like

[05:09] language so verily my Lord the sites

[05:12] have left the again the king coming with

[05:15] my curses with precious pale and then

[05:19] tranos say something else Etc and this

[05:21] is just coming out of the Transformer in

[05:23] a very similar manner as it would come

[05:25] out in chat GPT in our case character by

[05:27] character in chat GPT uh it's coming out

[05:31] on the token by token level and tokens

[05:33] are these sort of like little subword

[05:35] pieces so they're not Word level they're

[05:36] kind of like word chunk

[05:38] level um and now I've already written

[05:43] this entire code uh to train these

[05:45] Transformers um and it is in a GitHub

[05:48] repository that you can find and it's

[05:50] called nanog

[05:51] GPT so nanog GPT is a repository that

[05:54] you can find in my GitHub and it's a

[05:56] repository for training Transformers um

[05:59] on any given text and what I think is

[06:02] interesting about it because there's

[06:03] many ways to train Transformers but this

[06:05] is a very simple implementation so it's

[06:06] just two files of 300 lines of code each

[06:10] one file defines the GPT model the

[06:12] Transformer and one file trains it on

[06:14] some given Text data set and here I'm

[06:17] showing that if you train it on a open

[06:18] web Text data set which is a fairly

[06:20] large data set of web pages then I

[06:22] reproduce the the performance of

[06:25] gpt2 so gpt2 is an early version of open

[06:29] AI GPT uh from 2017 if I recall

[06:32] correctly and I've only so far

[06:34] reproduced the the smallest 124 million

[06:36] parameter model uh but basically this is

[06:38] just proving that the codebase is

[06:39] correctly arranged and I'm able to load

[06:42] the uh neural network weights that openi

[06:45] has released later so you can take a

[06:48] look at the finished code here in N GPT

[06:50] but what I would like to do in this

[06:51] lecture is I would like to basically uh

[06:55] write this repository from scratch so

[06:57] we're going to begin with an empty file

[06:59] and we're we're going to define a

[07:00] Transformer piece by piece we're going

[07:03] to train it on the tiny Shakespeare data

[07:05] set and we'll see how we can then uh

[07:08] generate infinite Shakespeare and of

[07:10] course this can copy paste to any

[07:12] arbitrary Text data set uh that you like

[07:14] uh but my goal really here is to just

[07:16] make you understand and appreciate uh

[07:18] how under the hood chat GPT works and um

[07:22] really all that's required is a

[07:24] Proficiency in Python and uh some basic

[07:27] understanding of um calculus and

[07:29] statistics

[07:30] and it would help if you also see my

[07:32] previous videos on the same YouTube

[07:34] channel in particular my make more

[07:35] series where I um Define smaller and

[07:40] simpler neural network language models

[07:42] uh so multi perceptrons and so on it

[07:45] really introduces the language modeling

[07:46] framework and then uh here in this video

[07:49] we're going to focus on the Transformer

[07:50] neural network itself okay so I created

[07:53] a new Google collab uh jup notebook here

[07:57] and this will allow me to later easily

[07:58] share this code that we're going to

[08:00] develop together uh with you so you can

[08:01] follow along so this will be in a video

[08:03] description uh later now here I've just

[08:07] done some preliminaries I downloaded the

[08:09] data set the tiny Shakespeare data set

[08:10] at this URL and you can see that it's

[08:12] about a 1 Megabyte file then here I open

[08:15] the input.txt file and just read in all

[08:17] the text of the string and we see that

[08:20] we are working with 1 million characters

[08:22] roughly and the first 1,000 characters

[08:24] if we just print them out are basically

[08:26] what you would expect this is the first

[08:28] 1,000 characters of the tiny Shakespeare

[08:30] data set roughly up to here so so far so

[08:34] good next we're going to take this text

[08:37] and the text is a sequence of characters

[08:39] in Python so when I call the set

[08:41] Constructor on it I'm just going to get

[08:44] the set of all the characters that occur

[08:46] in this text and then I call list on

[08:49] that to create a list of those

[08:51] characters instead of just a set so that

[08:53] I have an ordering an arbitrary ordering

[08:56] and then I sort that so basically we get

[08:59] just all the characters that occur in

[09:00] the entire data set and they're sorted

[09:02] now the number of them is going to be

[09:04] our vocabulary size these are the

[09:06] possible elements of our sequences and

[09:09] we see that when I print here the

[09:11] characters there's 65 of them in total

[09:14] there's a space character and then all

[09:16] kinds of special characters and then U

[09:19] capitals and lowercase letters so that's

[09:21] our vocabulary and that's the sort of

[09:23] like possible uh characters that the

[09:25] model can see or emit okay so next we

[09:29] will would like to develop some strategy

[09:31] to tokenize the input text now when

[09:35] people say tokenize they mean convert

[09:36] the raw text as a string to some

[09:39] sequence of integers According to some

[09:41] uh notebook According to some vocabulary

[09:43] of possible elements so as an example

[09:46] here we are going to be building a

[09:48] character level language model so we're

[09:49] simply going to be translating

[09:50] individual characters into integers so

[09:53] let me show you uh a chunk of code that

[09:55] sort of does that for us so we're

[09:57] building both the encoder and the

[09:58] decoder

[10:00] and let me just talk through what's

[10:01] happening

[10:02] here when we encode an arbitrary text

[10:05] like hi there we're going to receive a

[10:08] list of integers that represents that

[10:10] string so for example 46 47 Etc and then

[10:14] we also have the reverse mapping so we

[10:17] can take this list and decode it to get

[10:20] back the exact same string so it's

[10:22] really just like a translation to

[10:24] integers and back for arbitrary string

[10:26] and for us it is done on a character

[10:28] level

[10:30] now the way this was achieved is we just

[10:31] iterate over all the characters here and

[10:34] create a lookup table from the character

[10:35] to the integer and vice versa and then

[10:38] to encode some string we simply

[10:40] translate all the characters

[10:41] individually and to decode it back we

[10:44] use the reverse mapping and concatenate

[10:46] all of it now this is only one of many

[10:49] possible encodings or many possible sort

[10:51] of tokenizers and it's a very simple one

[10:54] but there's many other schemas that

[10:55] people have come up with in practice so

[10:57] for example Google uses a sentence

[10:59] piece uh so sentence piece will also

[11:02] encode text into um integers but in a

[11:05] different schema and using a different

[11:08] vocabulary and sentence piece is a

[11:10] subword uh sort of tokenizer and what

[11:13] that means is that um you're not

[11:15] encoding entire words but you're not

[11:17] also encoding individual characters it's

[11:19] it's a subword unit level and that's

[11:22] usually what's adopted in practice for

[11:24] example also openai has this Library

[11:26] called tick token that uses a bite pair

[11:28] encode

[11:29] tokenizer um and that's what GPT uses

[11:33] and you can also just encode words into

[11:35] like hell world into a list of integers

[11:38] so as an example I'm using the Tik token

[11:40] Library here I'm getting the encoding

[11:43] for gpt2 or that was used for gpt2

[11:46] instead of just having 65 possible

[11:48] characters or tokens they have 50,000

[11:51] tokens and so when they encode the exact

[11:54] same string High there we only get a

[11:57] list of three integers but those

[11:59] integers are not between 0 and 64 they

[12:01] are between Z and 5,

[12:05] 5,256 so basically you can trade off the

[12:09] code book size and the sequence lengths

[12:12] so you can have very long sequences of

[12:13] integers with very small vocabularies or

[12:16] we can have short um sequences of

[12:20] integers with very large vocabularies

[12:23] and so typically people use in practice

[12:25] these subword encodings but I'd like to

[12:28] keep our token ier very simple so we're

[12:30] using character level tokenizer and that

[12:33] means that we have very small code books

[12:35] we have very simple encode and decode

[12:37] functions uh but we do get very long

[12:40] sequences as a result but that's the

[12:42] level at which we're going to stick with

[12:43] this lecture because it's the simplest

[12:45] thing okay so now that we have an

[12:46] encoder and a decoder effectively a

[12:49] tokenizer we can tokenize the entire

[12:51] training set of Shakespeare so here's a

[12:53] chunk of code that does that and I'm

[12:55] going to start to use the pytorch

[12:56] library and specifically the torch.

[12:58] tensor from the pytorch library so we're

[13:01] going to take all of the text in tiny

[13:03] Shakespeare encode it and then wrap it

[13:05] into a torch. tensor to get the data

[13:08] tensor so here's what the data tensor

[13:10] looks like when I look at just the first

[13:12] 1,000 characters or the 1,000 elements

[13:14] of it so we see that we have a massive

[13:16] sequence of integers and this sequence

[13:18] of integers here is basically an

[13:20] identical translation of the first

[13:22] 10,000 characters

[13:24] here so I believe for example that zero

[13:27] is a new line character and maybe one

[13:29] one is a space not 100% sure but from

[13:32] now on the entire data set of text is

[13:34] re-represented as just it's just

[13:35] stretched out as a single very large uh

[13:38] sequence of

[13:39] integers let me do one more thing before

[13:41] we move on here I'd like to separate out

[13:43] our data set into a train and a

[13:45] validation split so in particular we're

[13:48] going to take the first 90% of the data

[13:51] set and consider that to be the training

[13:52] data for the Transformer and we're going

[13:54] to withhold the last 10% at the end of

[13:56] it to be the validation data and this

[13:59] will help us understand to what extent

[14:01] our model is overfitting so we're going

[14:03] to basically hide and keep the

[14:04] validation data on the side because we

[14:06] don't want just a perfect memorization

[14:08] of this exact Shakespeare we want a

[14:11] neural network that sort of creates

[14:12] Shakespeare like uh text and so it

[14:15] should be fairly likely for it to

[14:17] produce the actual like stowed away uh

[14:21] true Shakespeare text um and so we're

[14:24] going to use this to uh get a sense of

[14:26] the overfitting okay so now we would

[14:28] like to start plugging these text

[14:30] sequences or integer sequences into the

[14:32] Transformer so that it can train and

[14:34] learn those patterns now the important

[14:36] thing to realize is we're never going to

[14:38] actually feed entire text into a

[14:40] Transformer all at once that would be

[14:42] computationally very expensive and

[14:44] prohibitive so when we actually train a

[14:46] Transformer on a lot of these data sets

[14:48] we only work with chunks of the data set

[14:50] and when we train the Transformer we

[14:52] basically sample random little chunks

[14:53] out of the training set and train on

[14:55] just chunks at a time and these chunks

[14:58] have basically some kind of a length and

[15:01] some maximum length now the maximum

[15:04] length typically at least in the code I

[15:06] usually write is called block size you

[15:08] can you can uh find it under different

[15:10] names like context length or something

[15:12] like that let's start with the block

[15:14] size of just eight and let me look at

[15:16] the first train data characters the

[15:18] first block size plus one characters

[15:20] I'll explain why plus one in a

[15:22] second so this is the first nine

[15:24] characters in the sequence in the

[15:27] training set now what I'd like to point

[15:30] out is that when you sample a chunk of

[15:31] data like this so say the these nine

[15:34] characters out of the training set this

[15:36] actually has multiple examples packed

[15:38] into it and uh that's because all of

[15:41] these characters follow each other and

[15:43] so what this thing is going to say when

[15:47] we plug it into a Transformer is we're

[15:49] going to actually simultaneously train

[15:50] it to make prediction at every one of

[15:52] these

[15:53] positions now in the in a chunk of nine

[15:56] characters there's actually eight indiv

[15:58] ual examples packed in there so there's

[16:01] the example that when 18 when in the

[16:04] context of 18 47 likely comes next in a

[16:08] context of 18 and 47 56 comes next in a

[16:12] context of 18 47 56 57 can come next and

[16:16] so on so that's the eight individual

[16:18] examples let me actually spell it out

[16:20] with

[16:21] code so here's a chunk of code to

[16:24] illustrate X are the inputs to the

[16:26] Transformer it will just be the first

[16:28] block size characters y will be the uh

[16:32] next block size characters so it's

[16:34] offset by one and that's because y are

[16:37] the targets for each position in the

[16:40] input and then here I'm iterating over

[16:42] all the block size of eight and the

[16:45] context is always all the characters in

[16:47] x uh up to T and including T and the

[16:51] target is always the teth character but

[16:53] in the targets array y so let me just

[16:56] run

[16:57] this and basically it spells out what I

[16:59] said in words uh these are the eight

[17:02] examples hidden in a chunk of nine

[17:04] characters that we uh sampled from the

[17:08] training set I want to mention one more

[17:11] thing we train on all the eight examples

[17:14] here with context between one all the

[17:16] way up to context of block size and we

[17:19] train on that not just for computational

[17:20] reasons because we happen to have the

[17:22] sequence already or something like that

[17:23] it's not just done for efficiency it's

[17:26] also done um to make the Transformer

[17:28] Network be used to seeing contexts all

[17:32] the way from as little as one all the

[17:33] way to block size and we'd like the

[17:36] transform to be used to seeing

[17:38] everything in between and that's going

[17:39] to be useful later during inference

[17:41] because while we're sampling we can

[17:43] start the sampling generation with as

[17:45] little as one character of context and

[17:47] the Transformer knows how to predict the

[17:49] next character with all the way up to

[17:51] just context of one and so then it can

[17:53] predict everything up to block size and

[17:55] after block size we have to start

[17:56] truncating because the Transformer will

[17:58] will never um receive more than block

[18:01] size inputs when it's predicting the

[18:03] next

[18:03] character Okay so we've looked at the

[18:06] time dimension of the tensors that are

[18:07] going to be feeding into the Transformer

[18:09] there's one more Dimension to care about

[18:11] and that is the batch Dimension and so

[18:13] as we're sampling these chunks of text

[18:15] we're going to be actually every time

[18:17] we're going to feed them into a

[18:18] Transformer we're going to have many

[18:20] batches of multiple chunks of text that

[18:22] are all like stacked up in a single

[18:23] tensor and that's just done for

[18:25] efficiency just so that we can keep the

[18:27] gpus busy uh because they are very good

[18:29] at parallel processing of um of data and

[18:33] so we just want to process multiple

[18:35] chunks all at the same time but those

[18:37] chunks are processed completely

[18:38] independently they don't talk to each

[18:39] other and so on so let me basically just

[18:42] generalize this and introduce a batch

[18:44] Dimension here's a chunk of

[18:46] code let me just run it and then I'm

[18:48] going to explain what it

[18:50] does so here because we're going to

[18:52] start sampling random locations in the

[18:54] data set to pull chunks from I am

[18:57] setting the seed so that um in the

[19:00] random number generator so that the

[19:01] numbers I see here are going to be the

[19:02] same numbers you see later if you try to

[19:04] reproduce this now the batch size here

[19:07] is how many independent sequences we are

[19:09] processing every forward backward pass

[19:11] of the

[19:12] Transformer the block size as I

[19:14] explained is the maximum context length

[19:16] to make those predictions so let's say B

[19:19] size four block size eight and then

[19:21] here's how we get batch for any

[19:23] arbitrary split if the split is a

[19:25] training split then we're going to look

[19:26] at train data otherwise at valid data

[19:30] that gives us the data array and then

[19:33] when I Generate random positions to grab

[19:35] a chunk out of I actually grab I

[19:38] actually generate batch size number of

[19:41] Random offsets so because this is four

[19:44] we are ex is going to be a uh four

[19:47] numbers that are randomly generated

[19:49] between zero and Len of data minus block

[19:51] size so it's just random offsets into

[19:53] the training

[19:54] set and then X's as I explained are the

[19:58] first first block size characters

[20:00] starting at I the Y's are the offset by

[20:05] one of that so just add plus one and

[20:08] then we're going to get those chunks for

[20:10] every one of integers I INX and use a

[20:14] torch. stack to take all those uh uh

[20:17] one-dimensional tensors as we saw here

[20:20] and we're going to um stack them up at

[20:24] rows and so they all become a row in a

[20:27] 4x8 tensor

[20:29] so here's where I'm printing then when I

[20:32] sample a batch XB and YB the inputs to

[20:35] the Transformer now are the input X is

[20:39] the 4x8 tensor four uh rows of eight

[20:44] columns and each one of these is a chunk

[20:47] of the training

[20:48] set and then the targets here are in the

[20:52] associated array Y and they will come in

[20:54] to the Transformer all the way at the

[20:55] end uh to um create the loss function

[20:59] uh so they will give us the correct

[21:01] answer for every single position inside

[21:03] X and then these are the four

[21:06] independent

[21:07] rows so spelled out as we did

[21:11] before uh this 4x8 array contains a

[21:14] total of 32 examples and they're

[21:17] completely independent as far as the

[21:19] Transformer is

[21:20] concerned uh so when the input is 24 the

[21:25] target is 43 or rather 43 here in the Y

[21:28] array

[21:29] when the input is 2443 the target is

[21:31] 58 uh when the input is 24 43 58 the

[21:34] target is 5 Etc or like when it is a 52

[21:38] 581 the target is

[21:40] 58 right so you can sort of see this

[21:43] spelled out these are the 32 independent

[21:45] examples packed in to a single batch of

[21:48] the input X and then the desired targets

[21:51] are in y and so now this integer tensor

[21:57] of um X is going to feed into the

[22:00] Transformer and that Transformer is

[22:02] going to simultaneously process all

[22:04] these examples and then look up the

[22:06] correct um integers to predict in every

[22:08] one of these positions in the tensor y

[22:11] okay so now that we have our batch of

[22:13] input that we'd like to feed into a

[22:15] Transformer let's start basically

[22:16] feeding this into neural networks now

[22:19] we're going to start off with the

[22:20] simplest possible neural network which

[22:22] in the case of language modeling in my

[22:23] opinion is the Byram language model and

[22:25] we've covered the Byram language model

[22:26] in my make more series in a lot of depth

[22:29] and so here I'm going to sort of go

[22:31] faster and let's just Implement pytorch

[22:33] module directly that implements the byr

[22:36] language

[22:36] model so I'm importing the pytorch um NN

[22:41] module uh for

[22:43] reproducibility and then here I'm

[22:44] constructing a Byram language model

[22:46] which is a subass of NN

[22:48] module and then I'm calling it and I'm

[22:51] passing it the inputs and the targets

[22:53] and I'm just printing now when the

[22:55] inputs on targets come here you see that

[22:57] I'm just taking the index uh the inputs

[23:00] X here which I rename to idx and I'm

[23:03] just passing them into this token

[23:04] embedding table so it's going on here is

[23:07] that here in the Constructor we are

[23:09] creating a token embedding table and it

[23:12] is of size vocap size by vocap

[23:15] size and we're using an. embedding which

[23:18] is a very thin wrapper around basically

[23:20] a tensor of shape voap size by vocab

[23:23] size and what's happening here is that

[23:25] when we pass idx here every single

[23:28] integer in our input is going to refer

[23:30] to this embedding table and it's going

[23:32] to pluck out a row of that embedding

[23:34] table corresponding to its index so 24

[23:37] here will go into the embedding table

[23:39] and we'll pluck out the 24th row and

[23:42] then 43 will go here and pluck out the

[23:44] 43d row Etc and then pytorch is going to

[23:47] arrange all of this into a batch by Time

[23:50] by channel uh tensor in this case batch

[23:53] is four time is eight and C which is the

[23:57] channels is vocab size or 65 and so

[24:01] we're just going to pluck out all those

[24:02] rows arrange them in a b by T by C and

[24:05] now we're going to interpret this as the

[24:07] logits which are basically the scores

[24:10] for the next character in the sequence

[24:12] and so what's happening here is we are

[24:14] predicting what comes next based on just

[24:17] the individual identity of a single

[24:19] token and you can do that because um I

[24:22] mean currently the tokens are not

[24:23] talking to each other and they're not

[24:25] seeing any context except for they're

[24:26] just seeing themselves so I'm a f I'm a

[24:29] token number five and then I can

[24:32] actually make pretty decent predictions

[24:33] about what comes next just by knowing

[24:35] that I'm token five because some

[24:37] characters uh know um C follow other

[24:39] characters in in typical scenarios so we

[24:42] saw a lot of this in a lot more depth in

[24:44] the make more series and here if I just

[24:46] run this then we currently get the

[24:49] predictions the scores the lits for

[24:53] every one of the 4x8 positions now that

[24:55] we've made predictions about what comes

[24:57] next we'd like to evaluate the loss

[24:58] function and so in make more series we

[25:00] saw that a good way to measure a loss or

[25:03] like a quality of the predictions is to

[25:05] use the negative log likelihood loss

[25:07] which is also implemented in pytorch

[25:09] under the name cross entropy so what we'

[25:12] like to do here is loss is the cross

[25:15] entropy on the predictions and the

[25:17] targets and so this measures the quality

[25:20] of the logits with respect to the

[25:21] Targets in other words we have the

[25:24] identity of the next character so how

[25:26] well are we predicting the next

[25:28] character based on the lits and

[25:30] intuitively the correct um the correct

[25:33] dimension of low jits uh depending on

[25:36] whatever the target is should have a

[25:38] very high number and all the other

[25:39] dimensions should be very low number

[25:41] right now the issue is that this won't

[25:44] actually this is what we want we want to

[25:46] basically output the logits and the

[25:50] loss this is what we want but

[25:52] unfortunately uh this won't actually run

[25:55] we get an error message but intuitively

[25:57] we want to uh measure this now when we

[26:01] go to the pytorch um cross entropy

[26:04] documentation here um we're trying to

[26:08] call the cross entropy in its functional

[26:10] form uh so that means we don't have to

[26:11] create like a module for it but here

[26:14] when we go to the documentation you have

[26:16] to look into the details of how pitor

[26:18] expects these inputs and basically the

[26:20] issue here is ptor expects if you have

[26:24] multi-dimensional input which we do

[26:25] because we have a b BYT by C tensor then

[26:28] it actually really wants the channels to

[26:31] be the second uh Dimension here so if

[26:35] you um so basically it wants a b by C

[26:38] BYT instead of a b by T by C and so it's

[26:42] just the details of how P torch treats

[26:45] um these kinds of inputs and so we don't

[26:49] actually want to deal with that so what

[26:51] we're going to do instead is we need to

[26:52] basically reshape our logits so here's

[26:54] what I like to do I like to take

[26:56] basically give names to the dimensions

[26:58] so lit. shape is B BYT by C and unpack

[27:01] those numbers and then let's uh say that

[27:04] logits equals lit. View and we want it

[27:07] to be a b * c b * T by C so just a two-

[27:11] dimensional

[27:12] array right so we're going to take all

[27:15] the we're going to take all of these um

[27:18] positions here and we're going to uh

[27:20] stretch them out in a onedimensional

[27:22] sequence and uh preserve the channel

[27:25] Dimension as the second

[27:26] dimension so we're just kind of like

[27:28] stretching out the array so it's two-

[27:29] dimensional and in that case it's going

[27:31] to better conform to what pytorch uh

[27:33] sort of expects in its Dimensions now we

[27:36] have to do the same to targets because

[27:38] currently targets are um of shape B by T

[27:44] and we want it to be just B * T so

[27:47] onedimensional now alternatively you

[27:49] could always still just do minus one

[27:51] because pytor will guess what this

[27:53] should be if you want to lay it out uh

[27:55] but let me just be explicit and say p *

[27:57] t once we've reshaped this it will match

[28:00] the cross entropy case and then we

[28:03] should be able to evaluate our

[28:06] loss okay so that R now and we can do

[28:10] loss and So currently we see that the

[28:12] loss is

[28:13] 4.87 now because our uh we have 65

[28:17] possible vocabulary elements we can

[28:19] actually guess at what the loss should

[28:20] be and in

[28:22] particular we covered negative log

[28:24] likelihood in a lot of detail we are

[28:26] expecting log or lawn of um 1 over 65

[28:32] and negative of that so we're expecting

[28:34] the loss to be about 4.1 17 but we're

[28:37] getting 4.87 and so that's telling us

[28:40] that the initial predictions are not uh

[28:42] super diffuse they've got a little bit

[28:43] of entropy and so we're guessing wrong

[28:47] uh so uh yes but actually we're I a we

[28:50] are able to evaluate the loss okay so

[28:53] now that we can evaluate the quality of

[28:54] the model on some data we'd like to also

[28:57] be able to generate from the model so

[28:59] let's do the generation now I'm going to

[29:01] go again a little bit faster here

[29:03] because I covered all this already in

[29:04] previous

[29:05] videos

[29:07] so here's a generate function for the

[29:11] model so we take some uh we take the the

[29:15] same kind of input idx here and

[29:18] basically this is the current uh context

[29:22] of some characters in a batch in some

[29:24] batch so it's also B BYT and the job of

[29:28] generate is to basically take this B BYT

[29:30] and extend it to be B BYT + 1 plus 2

[29:32] plus 3 and so it's just basically it

[29:34] continues the generation in all the

[29:36] batch dimensions in the time Dimension

[29:39] So that's its job and it will do that

[29:41] for Max new tokens so you can see here

[29:43] on the bottom there's going to be some

[29:45] stuff here but on the bottom whatever is

[29:47] predicted is concatenated on top of the

[29:50] previous idx along the First Dimension

[29:53] which is the time Dimension to create a

[29:54] b BYT + one so that becomes a new idx so

[29:58] the job of generate is to take a b BYT

[30:00] and make it a b BYT plus 1 plus 2 plus

[30:02] three as many as we want Max new tokens

[30:05] so this is the generation from the model

[30:08] now inside the generation what what are

[30:10] we doing we're taking the current

[30:11] indices we're getting the predictions so

[30:15] we get uh those are in the low jits and

[30:18] then the loss here is going to be

[30:19] ignored because um we're not we're not

[30:21] using that and we have no targets that

[30:23] are sort of ground truth targets that

[30:25] we're going to be comparing with

[30:28] then once we get the logits we are only

[30:30] focusing on the last step so instead of

[30:33] a b by T by C we're going to pluck out

[30:36] the negative-1 the last element in the

[30:38] time Dimension because those are the

[30:40] predictions for what comes next so that

[30:42] gives us the logits which we then

[30:44] convert to probabilities via softmax and

[30:47] then we use tor. multinomial to sample

[30:49] from those probabilities and we ask

[30:51] pytorch to give us one sample and so idx

[30:54] next will become a b by one because in

[30:57] each uh one of the batch Dimensions

[31:00] we're going to have a single prediction

[31:01] for what comes next so this num samples

[31:03] equals one will make this be a

[31:06] one and then we're going to take those

[31:08] integers that come from the sampling

[31:10] process according to the probability

[31:11] distribution given here and those

[31:13] integers got just concatenated on top of

[31:15] the current sort of like running stream

[31:17] of integers and this gives us a b BYT +

[31:20] one and then we can return that now one

[31:24] thing here is you see how I'm calling

[31:26] self of idx which will end up going to

[31:29] the forward function I'm not providing

[31:31] any Targets So currently this would give

[31:33] an error because targets is uh is uh

[31:36] sort of like not given so targets has to

[31:39] be optional so targets is none by

[31:41] default and then if targets is none then

[31:44] there's no loss to create so it's just

[31:47] loss is none but else all of this

[31:50] happens and we can create a loss so this

[31:53] will make it so um if we have the

[31:56] targets we provide them and get a loss

[31:57] if we have no targets it will'll just

[31:59] get the

[32:00] loits so this here will generate from

[32:02] the model um and let's take that for a

[32:06] ride

[32:08] now oops so I have another code chunk

[32:11] here which will generate for the model

[32:13] from the model and okay this is kind of

[32:15] crazy so maybe let me let me break this

[32:18] down so these are the idx

[32:23] right I'm creating a batch will be just

[32:26] one time will be just one so I'm

[32:30] creating a little one by one tensor and

[32:32] it's holding a zero and the D type the

[32:35] data type is uh integer so zero is going

[32:38] to be how we kick off the generation and

[32:40] remember that zero is uh is the element

[32:44] standing for a new line character so

[32:45] it's kind of like a reasonable thing to

[32:47] to feed in as the very first character

[32:49] in a sequence to be the new

[32:51] line um so it's going to be idx which

[32:54] we're going to feed in here then we're

[32:56] going to ask for 100 tokens

[32:58] and then. generate will continue that

[33:01] now because uh generate works on the

[33:05] level of batches we we then have to

[33:07] index into the zero throw to basically

[33:09] unplug the um the single batch Dimension

[33:13] that exists and then that gives us a um

[33:18] time steps just a onedimensional array

[33:20] of all the indices which we will convert

[33:23] to simple python list from pytorch

[33:26] tensor so that that can feed into our

[33:28] decode function and uh convert those

[33:32] integers into text so let me bring this

[33:34] back and we're generating 100 tokens

[33:37] let's

[33:37] run and uh here's the generation that we

[33:40] achieved so obviously it's garbage and

[33:43] the reason it's garbage is because this

[33:44] is a totally random model so next up

[33:47] we're going to want to train this model

[33:49] now one more thing I wanted to point out

[33:50] here is this function is written to be

[33:53] General but it's kind of like ridiculous

[33:55] right now because

[33:58] we're feeding in all this we're building

[33:59] out this context and we're concatenating

[34:02] it all and we're always feeding it all

[34:05] into the model but that's kind of

[34:07] ridiculous because this is just a simple

[34:09] Byram model so to make for example this

[34:11] prediction about K we only needed this W

[34:14] but actually what we fed into the model

[34:15] is we fed the entire sequence and then

[34:18] we only looked at the very last piece

[34:20] and predicted K so the only reason I'm

[34:23] writing it in this way is because right

[34:25] now this is a byr model but I'd like to

[34:27] keep keep this function fixed and I'd

[34:29] like it to work um later when our

[34:32] characters actually um basically look

[34:35] further in the history and so right now

[34:37] the history is not used so this looks

[34:39] silly uh but eventually the history will

[34:42] be used and so that's why we want to uh

[34:44] do it this way so just a quick comment

[34:46] on that so now we see that this is um

[34:49] random so let's train the model so it

[34:51] becomes a bit less random okay let's Now

[34:53] train the model so first what I'm going

[34:55] to do is I'm going to create a pyour

[34:57] optimization object so here we are using

[35:00] the optimizer ATM W um now in a make

[35:05] more series we've only ever use tastic

[35:06] gradi in descent the simplest possible

[35:08] Optimizer which you can get using the

[35:10] SGD instead but I want to use Adam which

[35:12] is a much more advanced and popular

[35:14] Optimizer and it works extremely well

[35:16] for uh typical good setting for the

[35:19] learning rate is roughly 3 E4 uh but for

[35:22] very very small networks like is the

[35:23] case here you can get away with much

[35:25] much higher learning rates R3 or even

[35:28] higher probably but let me create the

[35:30] optimizer object which will basically

[35:33] take the gradients and uh update the

[35:35] parameters using the

[35:36] gradients and then here our batch size

[35:40] up above was only four so let me

[35:41] actually use something bigger let's say

[35:43] 32 and then for some number of steps um

[35:46] we are sampling a new batch of data

[35:48] we're evaluating the loss uh we're

[35:51] zeroing out all the gradients from the

[35:52] previous step getting the gradients for

[35:54] all the parameters and then using those

[35:56] gradients to up update our parameters so

[35:58] typical training loop as we saw in the

[36:00] make more series so let me now uh run

[36:04] this for say 100 iterations and let's

[36:07] see what kind of losses we're going to

[36:09] get so we started around

[36:12] 4.7 and now we're getting to down to

[36:14] like 4.6 4.5 Etc so the optimization is

[36:18] definitely happening but um let's uh

[36:22] sort of try to increase number of

[36:23] iterations and only print at the

[36:25] end because we probably want train for

[36:29] longer okay so we're down to 3.6

[36:34] roughly roughly down to

[36:40] three this is the most janky

[36:46] optimization okay it's working let's

[36:48] just do

[36:50] 10,000 and then from here we want to

[36:53] copy this and hopefully that we're going

[36:56] to get something reason and of course

[36:58] it's not going to be Shakespeare from a

[37:00] byr model but at least we see that the

[37:01] loss is improving and uh hopefully we're

[37:05] expecting something a bit more

[37:06] reasonable okay so we're down at about

[37:08] 2.5 is let's see what we get okay

[37:12] dramatic improvements certainly on what

[37:14] we had here so let me just increase the

[37:17] number of tokens okay so we see that

[37:19] we're starting to get something at least

[37:21] like reasonable is

[37:25] um certainly not shakes spear but uh the

[37:29] model is making progress so that is the

[37:31] simplest possible

[37:33] model so now what I'd like to do

[37:36] is obviously this is a very simple model

[37:39] because the tokens are not talking to

[37:41] each other so given the previous context

[37:43] of whatever was generated we're only

[37:45] looking at the very last character to

[37:46] make the predictions about what comes

[37:48] next so now these uh now these tokens

[37:50] have to start talking to each other and

[37:53] figuring out what is in the context so

[37:55] that they can make better predictions

[37:56] for what comes next and this is how

[37:57] we're going to kick off the uh

[37:59] Transformer okay so next I took the code

[38:02] that we developed in this juper notebook

[38:03] and I converted it to be a script and

[38:05] I'm doing this because I just want to

[38:08] simplify our intermediate work into just

[38:10] the final product that we have at this

[38:12] point so in the top here I put all the

[38:15] hyp parameters that we to find I

[38:16] introduced a few and I'm going to speak

[38:18] to that in a little bit otherwise a lot

[38:20] of this should be recognizable uh

[38:23] reproducibility read data get the

[38:25] encoder and the decoder create the train

[38:27] into splits uh use the uh kind of like

[38:30] data loader um that gets a batch of the

[38:34] inputs and Targets this is new and I'll

[38:36] talk about it in a second now this is

[38:39] the Byram language model that we

[38:40] developed and it can forward and give us

[38:43] a logits and loss and it can

[38:45] generate and then here we are creating

[38:48] the optimizer and this is the training

[38:51] Loop so everything here should look

[38:53] pretty familiar now some of the small

[38:55] things that I added number one I added

[38:57] the ability to run on a GPU if you have

[39:00] it so if you have a GPU then you can

[39:02] this will use Cuda instead of just CPU

[39:04] and everything will be a lot more faster

[39:07] now when device becomes Cuda then we

[39:09] need to make sure that when we load the

[39:11] data we move it to

[39:13] device when we create the model we want

[39:15] to move uh the model parameters to

[39:18] device so as an example here we have the

[39:21] N an embedding table and it's got a

[39:23] weight inside it which stores the uh

[39:26] sort of lookup table so so that would be

[39:27] moved to the GPU so that all the

[39:29] calculations here happen on the GPU and

[39:32] they can be a lot faster and then

[39:34] finally here when I'm creating the

[39:35] context that feeds in to generate I have

[39:37] to make sure that I create it on the

[39:39] device number two what I introduced is

[39:43] uh the fact that here in the training

[39:46] Loop here I was just printing the um l.

[39:50] item inside the training Loop but this

[39:53] is a very noisy measurement of the

[39:54] current loss because every batch will be

[39:56] more or less lucky and so what I want to

[39:59] do usually um is uh I have an estimate

[40:02] loss function and the estimate loss

[40:05] basically then um goes up here and it

[40:10] averages up the loss over multiple

[40:12] batches so in particular we're going to

[40:15] iterate eval iter times and we're going

[40:17] to basically get our loss and then we're

[40:19] going to get the average loss for both

[40:21] splits and so this will be a lot less

[40:24] noisy so here when we call the estimate

[40:26] loss we're we're going to report the uh

[40:28] pretty accurate train and validation

[40:31] loss now when we come back up you'll

[40:33] notice a few things here I'm setting the

[40:35] model to evaluation phase and down here

[40:38] I'm resetting it back to training phase

[40:40] now right now for our model as is this

[40:42] doesn't actually do anything because the

[40:44] only thing inside this model is this uh

[40:46] nn. embedding and um this this um

[40:51] Network would behave both would behave

[40:53] the same in both evaluation mode and

[40:55] training mode we have no drop off layers

[40:57] we have no batm layers Etc but it is a

[41:00] good practice to Think Through what mode

[41:02] your neural network is in because some

[41:04] layers will have different Behavior Uh

[41:07] at inference time or training time and

[41:11] there's also this context manager torch

[41:12] up nograd and this is just telling

[41:14] pytorch that everything that happens

[41:16] inside this function we will not call do

[41:18] backward on and so pytorch can be a lot

[41:21] more efficient with its memory use

[41:23] because it doesn't have to store all the

[41:25] intermediate variables uh because we're

[41:27] never going to call backward and so it

[41:29] can it can be a lot more memory

[41:30] efficient in that way so also a good

[41:32] practice to tpy torch when we don't

[41:35] intend to do back

[41:36] propagation so right now this script is

[41:39] about 120 lines of code of and that's

[41:43] kind of our starter code I'm calling it

[41:45] b.p and I'm going to release it later

[41:48] now running this

[41:50] script gives us output in the terminal

[41:52] and it looks something like this it

[41:54] basically as I ran this code uh it was

[41:57] giving me the train loss and Val loss

[41:59] and we see that we convert to somewhere

[42:01] around

[42:01] 2.5 with the pyr model and then here's

[42:04] the sample that we produced at the

[42:07] end and so we have everything packaged

[42:09] up in the script and we're in a good

[42:11] position now to iterate on this okay so

[42:13] we are almost ready to start writing our

[42:15] very first self attention block for

[42:18] processing these uh tokens now before we

[42:22] actually get there I want to get you

[42:24] used to a mathematical trick that is

[42:26] used in the self attention inside a

[42:28] Transformer and is really just like at

[42:30] the heart of an an efficient

[42:32] implementation of self attention and so

[42:34] I want to work with this toy example to

[42:36] just get you used to this operation and

[42:38] then it's going to make it much more

[42:39] clear once we actually get to um to it

[42:43] uh in the script

[42:44] again so let's create a b BYT by C where

[42:47] BT and C are just 48 and two in the toy

[42:50] example and these are basically channels

[42:53] and we have uh batches and we have the

[42:55] time component and we have information

[42:58] at each point in the sequence so

[43:01] see now what we would like to do is we

[43:03] would like these um tokens so we have up

[43:06] to eight tokens here in a batch and

[43:08] these eight tokens are currently not

[43:10] talking to each other and we would like

[43:11] them to talk to each other we'd like to

[43:13] couple them and in particular we don't

[43:17] we we want to couple them in a very

[43:18] specific way so the token for example at

[43:21] the fifth location it should not

[43:23] communicate with tokens in the sixth

[43:25] seventh and eighth location

[43:27] because uh those are future tokens in

[43:29] the sequence the token on the fifth

[43:31] location should only talk to the one in

[43:33] the fourth third second and first so

[43:36] it's only so information only flows from

[43:38] previous context to the current time

[43:40] step and we cannot get any information

[43:42] from the future because we are about to

[43:44] try to predict the

[43:45] future so what is the easiest way for

[43:49] tokens to communicate okay the easiest

[43:52] way I would say is okay if we're up to

[43:54] if we're a fifth token and I'd like to

[43:56] communicate with my past the simplest

[43:58] way we can do that is to just do a

[44:00] weight is to just do an average of all

[44:03] the um of all the preceding elements so

[44:06] for example if I'm the fif token I would

[44:08] like to take the channels uh that make

[44:10] up that are information at my step but

[44:13] then also the channels from the fourth

[44:15] step third step second step and the

[44:17] first step I'd like to average those up

[44:19] and then that would become sort of like

[44:21] a feature Vector that summarizes me in

[44:23] the context of my history now of course

[44:26] just doing a sum or like an average is

[44:28] an extremely weak form of interaction

[44:30] like this communication is uh extremely

[44:32] lossy we've lost a ton of information

[44:34] about the spatial Arrangements of all

[44:35] those tokens uh but that's okay for now

[44:38] we'll see how we can bring that

[44:39] information back later for now what we

[44:41] would like to do is for every single

[44:43] batch element independently for every

[44:46] teeth token in that sequence we'd like

[44:49] to now calculate the average of all the

[44:53] vectors in all the previous tokens and

[44:55] also at this token so let's write that

[44:58] out um I have a small snippet here and

[45:01] instead of just fumbling around let me

[45:03] just copy paste it and talk to

[45:05] it so in other words we're going to

[45:08] create X and B is short for bag of words

[45:12] because bag of words is um is kind of

[45:15] like um a term that people use when you

[45:17] are just averaging up things so this is

[45:19] just a bag of words basically there's a

[45:21] word stored on every one of these eight

[45:23] locations and we're doing a bag of words

[45:25] we're just averaging

[45:27] so in the beginning we're going to say

[45:28] that it's just initialized at Zero and

[45:30] then I'm doing a for Loop here so we're

[45:32] not being efficient yet that's coming

[45:34] but for now we're just iterating over

[45:36] all the batch Dimensions independently

[45:38] iterating over time and then the

[45:40] previous uh tokens are at this uh batch

[45:45] Dimension and then everything up to and

[45:47] including the teeth token okay so when

[45:51] we slice out X in this way X prev

[45:54] Becomes of shape um how many T elements

[45:58] there were in the past and then of

[46:00] course C so all the two-dimensional

[46:02] information from these little tokens so

[46:05] that's the previous uh sort of chunk of

[46:08] um tokens from my current sequence and

[46:12] then I'm just doing the average or the

[46:13] mean over the zero Dimension so I'm

[46:15] averaging out the time here and I'm just

[46:19] going to get a little c one dimensional

[46:21] Vector which I'm going to store in X bag

[46:23] of words so I can run this and and uh

[46:27] this is not going to be very informative

[46:30] because let's see so this is X of Zer so

[46:32] this is the zeroth batch element and

[46:35] then expo at zero now you see how the at

[46:40] the first location here you see that the

[46:42] two are equal and that's because it's

[46:45] we're just doing an average of this one

[46:46] token but here this one is now an

[46:49] average of these two and now this one is

[46:53] an average of these

[46:54] three and so on

[46:57] so uh and this last one is the average

[47:01] of all of these elements so vertical

[47:03] average just averaging up all the tokens

[47:05] now gives this outcome

[47:07] here so this is all well and good uh but

[47:10] this is very inefficient now the trick

[47:12] is that we can be very very efficient

[47:14] about doing this using matrix

[47:16] multiplication so that's the

[47:18] mathematical trick and let me show you

[47:19] what I mean let's work with the toy

[47:21] example here let me run it and I'll

[47:24] explain I have a simple Matrix here that

[47:27] is a 3X3 of all ones a matrix B of just

[47:31] random numbers and it's a 3x2 and a

[47:33] matrix C which will be 3x3 multip 3x2

[47:36] which will give out a 3x2 so here we're

[47:39] just using um matrix multiplication so a

[47:43] multiply B gives us

[47:46] C okay so how are these numbers in C um

[47:51] achieved right so this number in the top

[47:54] left is the first row of a dot product

[47:57] with the First Column of B and since all

[48:00] the the row of a right now is all just

[48:02] ones then the do product here with with

[48:05] this column of B is just going to do a

[48:07] sum of these of this column so 2 + 6 + 6

[48:11] is

[48:12] 14 the element here in the output of C

[48:15] is also the first column here the first

[48:17] row of a multiplied now with the second

[48:20] column of B so 7 + 4 + 5 is 16 now you

[48:25] see that there's repeating elements here

[48:26] so this 14 again is because this row is

[48:28] again all ones and it's multiplying the

[48:30] First Column of B so we get 14 and this

[48:33] one is and so on so this last number

[48:35] here is the last row do product last

[48:39] column now the trick here is uh the

[48:42] following this is just a boring number

[48:44] of um it's just a boring array of all

[48:48] ones but torch has this function called

[48:50] Trail which is short for a

[48:54] triangular uh something like that and

[48:56] you can wrap it in torch up once and it

[48:58] will just return the lower triangular

[49:00] portion of this

[49:03] okay so now it will basically zero out

[49:06] uh these guys here so we just get the

[49:08] lower triangular part well what happens

[49:10] if we do

[49:14] that so now we'll have a like this and B

[49:17] like this and now what are we getting

[49:18] here in C well what is this number well

[49:22] this is the first row times the First

[49:24] Column and because this is zeros

[49:28] uh these elements here are now ignored

[49:30] so we just get a two and then this

[49:32] number here is the first row times the

[49:35] second column and because these are

[49:37] zeros they get ignored and it's just

[49:39] seven this seven multiplies this one but

[49:42] look what happened here because this is

[49:43] one and then zeros we what ended up

[49:46] happening is we're just plucking out the

[49:48] row of this row of B and that's what we

[49:51] got now here we have one 1 Z so here 110

[49:57] do product with these two columns will

[49:59] now give us 2 + 6 which is 8 and 7 + 4

[50:02] which is 11 and because this is 111 we

[50:05] ended up with the addition of all of

[50:07] them and so basically depending on how

[50:10] many ones and zeros we have here we are

[50:12] basically doing a sum currently of a

[50:16] variable number of these rows and that

[50:18] gets deposited into

[50:20] C So currently we're doing sums because

[50:23] these are ones but we can also do

[50:25] average right and you can start to see

[50:27] how we could do average uh of the rows

[50:29] of B uh sort of in an incremental

[50:32] fashion because we don't have to we can

[50:35] basically normalize these rows so that

[50:37] they sum to one and then we're going to

[50:39] get an average so if we took a and then

[50:41] we did aals

[50:43] aide torch. sum in the um of a in the um

[50:51] oneth Dimension and then let's keep them

[50:55] as true so so therefore the broadcasting

[50:57] will work out so if I rerun this you see

[51:00] now that these rows now sum to one so

[51:04] this row is one this row is 0. 5.5 Z and

[51:07] here we get 1/3 and now when we do a

[51:09] multiply B what are we getting here we

[51:12] are just getting the first row first row

[51:15] here now we are getting the average of

[51:18] the first two

[51:20] rows okay so 2 and six average is four

[51:23] and four and seven average is

[51:25] 5.5 and on the bottom here we are now

[51:27] getting the average of these three rows

[51:31] so the average of all of elements of B

[51:33] are now deposited here and so you can

[51:36] see that by manipulating these uh

[51:40] elements of this multiplying Matrix and

[51:42] then multiplying it with any given

[51:44] Matrix we can do these averages in this

[51:47] incremental fashion because we just get

[51:50] um and we can manipulate that based on

[51:53] the elements of a okay so that's very

[51:55] convenient so let's let's swing back up

[51:57] here and see how we can vectorize this

[51:59] and make it much more efficient using

[52:00] what we've learned so in

[52:03] particular we are going to produce an

[52:05] array a but here I'm going to call it we

[52:08] short for weights but this is our

[52:11] a and this is how much of every row we

[52:14] want to average up and it's going to be

[52:17] an average because you can see that

[52:18] these rows sum to

[52:20] one so this is our a and then our B in

[52:23] this example of course is X

[52:27] so what's going to happen here now is

[52:29] that we are going to have an expo

[52:31] 2 and this Expo 2 is going to be way

[52:36] multiplying

[52:38] RX so let's think this true way is T BYT

[52:42] and this is Matrix multiplying in

[52:44] pytorch a b by T by

[52:47] C and it's giving us uh different what

[52:50] shape so pytorch will come here and it

[52:52] will see that these shapes are not the

[52:54] same so it will create a batch Dimension

[52:57] here and this is a batched matrix

[53:00] multiply and so it will apply this

[53:02] matrix multiplication in all the batch

[53:04] elements um in parallel and individually

[53:08] and then for each batch element there

[53:09] will be a t BYT multiplying T by C

[53:12] exactly as we had

[53:15] below so this will now create B by T by

[53:20] C and Expo 2 will now become identical

[53:24] to Expo

[53:28] so we can see that torch. all close of

[53:32] xbo and xbo 2 should be true

[53:36] now so this kind of like convinces us

[53:38] that uh these are in fact um the same so

[53:43] xbo and xbo 2 if I just print

[53:47] them uh okay we're not going to be able

[53:49] to okay we're not going to be able to

[53:51] just stare it down but

[53:54] um well let me try Expo basically just

[53:56] at the zeroth element and Expo two at

[53:58] the zeroth element so just the first

[53:59] batch and we should see that this and

[54:02] that should be identical which they

[54:04] are right so what happened here the

[54:07] trick is we were able to use batched

[54:09] Matrix multiply to do this uh

[54:12] aggregation really and it's a weighted

[54:15] aggregation and the weights are

[54:17] specified in this um T BYT array and

[54:21] we're basically doing weighted sums and

[54:24] uh these weighted sums are are U

[54:26] according to uh the weights inside here

[54:28] they take on sort of this triangular

[54:31] form and so that means that a token at

[54:33] the teth dimension will only get uh sort

[54:36] of um information from the um tokens

[54:39] perceiving it so that's exactly what we

[54:41] want and finally I would like to rewrite

[54:43] it in one more way and we're going to

[54:46] see why that's useful so this is the

[54:48] third version and it's also identical to

[54:50] the first and second but let me talk

[54:53] through it it uses

[54:54] softmax so Trill here is this Matrix

[55:00] lower triangular

[55:01] ones way begins as all

[55:05] zero okay so if I just print way in the

[55:07] beginning it's all zero then I

[55:11] used masked fill so what this is doing

[55:15] is we. masked fill it's all zeros and

[55:18] I'm saying for all the elements where

[55:20] Trill is equal equal Z make them be

[55:23] negative Infinity so all the elements

[55:26] where Trill is zero will become negative

[55:28] Infinity now so this is what we get and

[55:32] then the final line here is

[55:36] softmax so if I take a softmax along

[55:38] every single so dim is negative one so

[55:40] along every single row if I do softmax

[55:44] what is that going to

[55:46] do well softmax is um is also like a

[55:51] normalization operation right and so

[55:54] spoiler alert you get the exact same

[55:58] Matrix let me bring back to

[56:00] softmax and recall that in softmax we're

[56:02] going to exponentiate every single one

[56:04] of these and then we're going to divide

[56:06] by the sum and so if we exponentiate

[56:10] every single element here we're going to

[56:11] get a one and here we're going to get uh

[56:14] basically zero 0 z0 Z everywhere else

[56:17] and then when we normalize we just get

[56:19] one here we're going to get one one and

[56:21] then zeros and then softmax will again

[56:24] divide and this will give us 5.5 and so

[56:27] on and so this is also the uh the same

[56:30] way to produce uh this mask now the

[56:33] reason that this is a bit more

[56:34] interesting and the reason we're going

[56:36] to end up using it in self

[56:37] attention is that these weights here

[56:41] begin uh with zero and you can think of

[56:44] this as like an interaction strength or

[56:46] like an affinity so basically it's

[56:49] telling us how much of each uh token

[56:52] from the past do we want to Aggregate

[56:54] and average up

[56:57] and then this line is saying tokens from

[56:59] the past cannot communicate by setting

[57:02] them to negative Infinity we're saying

[57:04] that we will not aggregate anything from

[57:06] those

[57:07] tokens and so basically this then goes

[57:09] through softmax and through the weighted

[57:11] and this is the aggregation through

[57:12] matrix

[57:14] multiplication and so what this is now

[57:16] is you can think of these as um these

[57:19] zeros are currently just set by us to be

[57:21] zero but a quick preview is that these

[57:25] affinities between the tokens are not

[57:27] going to be just constant at zero

[57:29] they're going to be data dependent these

[57:31] tokens are going to start looking at

[57:32] each other and some tokens will find

[57:34] other tokens more or less interesting

[57:37] and depending on what their values are

[57:39] they're going to find each other

[57:41] interesting to different amounts and I'm

[57:42] going to call those affinities I think

[57:45] and then here we are saying the future

[57:47] cannot communicate with the past we're

[57:49] we're going to clamp them and then when

[57:51] we normalize and sum we're going to

[57:53] aggregate uh sort of their values

[57:56] depending on how interesting they find

[57:57] each other and so that's the preview for

[57:59] self attention and basically long story

[58:03] short from this entire section is that

[58:05] you can do weighted aggregations of your

[58:07] past

[58:08] Elements by having by using matrix

[58:12] multiplication of a lower triangular

[58:14] fashion and then the elements here in

[58:17] the lower triangular part are telling

[58:18] you how much of each element uh fuses

[58:21] into this position so we're going to use

[58:24] this trick now to develop the self

[58:25] attention block block so first let's get

[58:27] some quick preliminaries out of the way

[58:30] first the thing I'm kind of bothered by

[58:31] is that you see how we're passing in

[58:33] vocap size into the Constructor there's

[58:35] no need to do that because vocap size is

[58:36] already defined uh up top as a global

[58:38] variable so there's no need to pass this

[58:40] stuff

[58:41] around next what I want to do is I don't

[58:44] want to actually create I want to create

[58:46] like a level of indirection here where

[58:47] we don't directly go to the embedding

[58:49] for the um logits but instead we go

[58:52] through this intermediate phase because

[58:54] we're going to start making that bigger

[58:57] so let me introduce a new variable n

[58:59] embed it shorted for number of embedding

[59:02] Dimensions so

[59:04] nbed here will be say 32 that was a

[59:09] suggestion from GitHub co-pilot by the

[59:11] way um it also suest 32 which is a good

[59:14] number so this is an embedding table and

[59:16] only 32 dimensional

[59:18] embeddings so then here this is not

[59:21] going to give us logits directly instead

[59:23] this is going to give us token

[59:24] embeddings that's I'm going to call it

[59:27] and then to go from the token Tings to

[59:29] the logits we're going to need a linear

[59:30] layer so self. LM head let's call it

[59:34] short for language modeling head is n

[59:36] and linear from n ined up to vocap size

[59:39] and then when we swing over here we're

[59:41] actually going to get the loits by

[59:43] exactly what the co-pilot says now we

[59:46] have to be careful here because this C

[59:48] and this C are not equal um this is nmed

[59:52] C and this is vocap size so let's just

[59:55] say that n ined is equal to

[59:57] C and then this just creates one spous

[1:00:01] layer of interaction through a linear

[1:00:02] layer but uh this should basically

[1:00:11] run so we see that this runs and uh this

[1:00:15] currently looks kind of spous but uh

[1:00:17] we're going to build on top of this now

[1:00:19] next up so far we've taken these indices

[1:00:22] and we've encoded them based on the

[1:00:23] identity of the uh tokens in inside idx

[1:00:28] the next thing that people very often do

[1:00:30] is that we're not just encoding the

[1:00:31] identity of these tokens but also their

[1:00:33] position so we're going to have a second

[1:00:35] position uh embedding table here so

[1:00:38] self. position embedding table is an an

[1:00:41] embedding of block size by an embed and

[1:00:44] so each position from zero to block size

[1:00:46] minus one will also get its own

[1:00:47] embedding vector and then here first let

[1:00:50] me decode B BYT from idx do

[1:00:54] shape and then here we're also going to

[1:00:56] have a pause embedding which is the

[1:00:58] positional embedding and these are this

[1:01:00] is to arrange so this will be basically

[1:01:03] just integers from Z to T minus one and

[1:01:06] all of those integers from 0 to T minus

[1:01:08] one get embedded through the table to

[1:01:09] create a t by

[1:01:11] C and then here this gets renamed to

[1:01:14] just say x and x will be the addition of

[1:01:18] the token embeddings with the positional

[1:01:20] embeddings and here the broadcasting

[1:01:22] note will work out so B by T by C plus T

[1:01:25] by C

[1:01:26] this gets right aligned a new dimension

[1:01:28] of one gets added and it gets

[1:01:30] broadcasted across

[1:01:31] batch so at this point x holds not just

[1:01:34] the token identities but the positions

[1:01:37] at which these tokens occur and this is

[1:01:39] currently not that useful because of

[1:01:41] course we just have a simple byr model

[1:01:43] so it doesn't matter if you're in the

[1:01:44] fifth position the second position or

[1:01:46] wherever it's all translation invariant

[1:01:48] at this stage uh so this information

[1:01:50] currently wouldn't help uh but as we

[1:01:52] work on the self attention block we'll

[1:01:54] see that this starts to matter

[1:01:59] okay so now we get the Crux of self

[1:02:01] attention so this is probably the most

[1:02:03] important part of this video to

[1:02:05] understand we're going to implement a

[1:02:07] small self attention for a single

[1:02:08] individual head as they're called so we

[1:02:11] start off with where we were so all of

[1:02:13] this code is familiar so right now I'm

[1:02:16] working with an example where I Chang

[1:02:17] the number of channels from 2 to 32 so

[1:02:20] we have a 4x8 arrangement of tokens and

[1:02:24] each to and the information each token

[1:02:26] is currently 32 dimensional but we just

[1:02:28] are working with random

[1:02:30] numbers now we saw here that the code as

[1:02:34] we had it before does a uh simple weight

[1:02:37] simple average of all the past tokens

[1:02:41] and the current token so it's just the

[1:02:43] previous information and current

[1:02:44] information is just being mixed together

[1:02:45] in an average and that's what this code

[1:02:48] currently achieves and it Doo by

[1:02:50] creating this lower triangular structure

[1:02:52] which allows us to mask out this uh we

[1:02:55] uh Matrix that we create so we mask it

[1:02:59] out and then we normalize it and

[1:03:01] currently when we initialize the

[1:03:03] affinities between all the different

[1:03:05] sort of tokens or nodes I'm going to use

[1:03:08] those terms

[1:03:09] interchangeably so when we initialize

[1:03:11] the affinities between all the different

[1:03:13] tokens to be zero then we see that way

[1:03:16] gives us this um structure where every

[1:03:18] single row has these um uniform numbers

[1:03:22] and so that's what that's what then uh

[1:03:25] in this Matrix multiply makes it so that

[1:03:27] we're doing a simple

[1:03:28] average now we don't actually want this

[1:03:32] to be all uniform because different uh

[1:03:36] tokens will find different other tokens

[1:03:38] more or less interesting and we want

[1:03:40] that to be data dependent so for example

[1:03:42] if I'm a vowel then maybe I'm looking

[1:03:44] for consonants in my past and maybe I

[1:03:46] want to know what those consonants are

[1:03:48] and I want that information to flow to

[1:03:50] me and so I want to now gather

[1:03:52] information from the past but I want to

[1:03:54] do it in the data dependent way and this

[1:03:56] is the problem that self attention

[1:03:58] solves now the way self attention solves

[1:04:00] this is the following every single node

[1:04:03] or every single token at each position

[1:04:06] will emit two vectors it will emit a

[1:04:09] query and it will emit a

[1:04:12] key now the query Vector roughly

[1:04:15] speaking is what am I looking for and

[1:04:18] the key Vector roughly speaking is what

[1:04:20] do I

[1:04:21] contain and then the way we get

[1:04:24] affinities between these uh tokens now

[1:04:27] in a sequence is we basically just do a

[1:04:29] do product between the keys and the

[1:04:31] queries so my query dot products with

[1:04:35] all the keys of all the other tokens and

[1:04:37] that dot product now becomes

[1:04:41] wayy and so um if the key and the query

[1:04:45] are sort of aligned they will interact

[1:04:47] to a very high amount and then I will

[1:04:50] get to learn more about that specific

[1:04:52] token as opposed to any other token in

[1:04:55] the sequence

[1:04:56] so let's implement this

[1:05:00] now we're going to implement a

[1:05:03] single what's called head of self

[1:05:07] attention so this is just one head

[1:05:09] there's a hyper parameter involved with

[1:05:10] these heads which is the head size and

[1:05:13] then here I'm initializing linear

[1:05:15] modules and I'm using bias equals false

[1:05:18] so these are just going to apply a

[1:05:19] matrix multiply with some fixed

[1:05:21] weights and now let me produce a key and

[1:05:26] q k and Q by forwarding these modules on

[1:05:29] X so the size of this will now

[1:05:32] become B by T by 16 because that is the

[1:05:36] head size and the same here B by T by

[1:05:44] 16 so this being the head size so you

[1:05:47] see here that when I forward this linear

[1:05:49] on top of my X all the tokens in all the

[1:05:52] positions in the B BYT Arrangement all

[1:05:55] of them them in parallel and

[1:05:57] independently produce a key and a query

[1:05:59] so no communication has happened

[1:06:01] yet but the communication comes now all

[1:06:04] the queries will do product with all the

[1:06:07] keys so basically what we want is we

[1:06:09] want way now or the affinities between

[1:06:12] these to be query multiplying key but we

[1:06:16] have to be careful with uh we can't

[1:06:18] Matrix multiply this we actually need to

[1:06:20] transpose uh K but we have to be also

[1:06:23] careful because these are when you have

[1:06:25] The Bash Dimension so in particular we

[1:06:27] want to transpose uh the last two

[1:06:30] dimensions dimension1 and dimension -2

[1:06:33] so

[1:06:36] -21 and so this Matrix multiply now will

[1:06:40] basically do the following B by T by

[1:06:44] 16 Matrix multiplies B by 16 by T to

[1:06:49] give us B by T by

[1:06:53] T right

[1:06:56] so for every row of B we're now going to

[1:06:58] have a t Square Matrix giving us the

[1:07:01] affinities and these are now the way so

[1:07:04] they're not zeros they are now coming

[1:07:06] from this dot product between the keys

[1:07:08] and the queries so this can now run I

[1:07:11] can I can run this and the weighted

[1:07:13] aggregation now is a function in a data

[1:07:16] Bandon manner between the keys and

[1:07:18] queries of these nodes so just

[1:07:20] inspecting what happened

[1:07:22] here the way takes on this form

[1:07:26] and you see that before way was uh just

[1:07:29] a constant so it was applied in the same

[1:07:31] way to all the batch elements but now

[1:07:33] every single batch elements will have

[1:07:34] different sort of we because uh every

[1:07:37] single batch element contains different

[1:07:39] uh tokens at different positions and so

[1:07:41] this is not data dependent so when we

[1:07:44] look at just the zeroth uh Row for

[1:07:47] example in the input these are the

[1:07:49] weights that came out and so you can see

[1:07:51] now that they're not just exactly

[1:07:53] uniform um and in particular as an

[1:07:55] example here for the last row this was

[1:07:58] the eighth token and the eighth token

[1:08:00] knows what content it has and it knows

[1:08:02] at what position it's in and now the E

[1:08:04] token based on that uh creates a query

[1:08:08] hey I'm looking for this kind of stuff

[1:08:10] um I'm a vowel I'm on the E position I'm

[1:08:12] looking for any consonant at positions

[1:08:14] up to four and then all the nodes get to

[1:08:18] emit keys and maybe one of the channels

[1:08:20] could be I am a I am a consonant and I

[1:08:23] am in a position up to four and that

[1:08:25] that key would have a high number in

[1:08:27] that specific Channel and that's how the

[1:08:29] query and the key when they do product

[1:08:31] they can find each other and create a

[1:08:33] high affinity and when they have a high

[1:08:35] Affinity like say uh this token was

[1:08:38] pretty interesting to uh to this eighth

[1:08:41] token when they have a high Affinity

[1:08:43] then through the softmax I will end up

[1:08:45] aggregating a lot of its information

[1:08:47] into my position and so I'll get to

[1:08:49] learn a lot about

[1:08:51] it now just this we're looking at way

[1:08:55] after this has already happened um let

[1:08:59] me erase this operation as well so let

[1:09:01] me erase the masking and the softmax

[1:09:03] just to show you the under the hood

[1:09:04] internals and how that works so without

[1:09:07] the masking in the softmax Whey comes

[1:09:09] out like this right this is the outputs

[1:09:11] of the do products um and these are the

[1:09:14] raw outputs and they take on values from

[1:09:15] negative you know two to positive two

[1:09:18] Etc so that's the raw interactions and

[1:09:21] raw affinities between all the nodes but

[1:09:24] now if I'm going if I'm a fifth node I

[1:09:26] will not want to aggregate anything from

[1:09:28] the sixth node seventh node and the

[1:09:30] eighth node so actually we use the upper

[1:09:32] triangular masking so those are not

[1:09:35] allowed to

[1:09:37] communicate and now we actually want to

[1:09:40] have a nice uh distribution uh so we

[1:09:42] don't want to aggregate negative .11 of

[1:09:45] this node that's crazy so instead we

[1:09:47] exponentiate and normalize and now we

[1:09:49] get a nice distribution that sums to one

[1:09:51] and this is telling us now in the data

[1:09:52] dependent manner how much of information

[1:09:54] to aggregate from any of these tokens in

[1:09:56] the

[1:09:58] past so that's way and it's not zeros

[1:10:01] anymore but but it's calculated in this

[1:10:04] way now there's one more uh part to a

[1:10:08] single self attention head and that is

[1:10:10] that when we do the aggregation we don't

[1:10:12] actually aggregate the tokens exactly we

[1:10:15] aggregate we produce one more value here

[1:10:17] and we call that the

[1:10:20] value so in the same way that we

[1:10:22] produced p and query we're also going to

[1:10:23] create a value

[1:10:26] and

[1:10:26] then here we don't

[1:10:30] aggregate X we calculate a v which is

[1:10:34] just achieved by uh propagating this

[1:10:37] linear on top of X again and then we

[1:10:40] output way multiplied by V so V is the

[1:10:44] elements that we aggregate or the the

[1:10:46] vectors that we aggregate instead of the

[1:10:47] raw

[1:10:48] X and now of course uh this will make it

[1:10:51] so that the output here of this single

[1:10:53] head will be 16 dimensional because that

[1:10:55] is the head

[1:10:57] size so you can think of X as kind of

[1:10:59] like private information to this token

[1:11:01] if you if you think about it that way so

[1:11:03] X is kind of private to this token so

[1:11:06] I'm a fifth token at some and I have

[1:11:08] some identity and uh my information is

[1:11:11] kept in Vector X and now for the

[1:11:14] purposes of the single head here's what

[1:11:16] I'm interested in here's what I have and

[1:11:20] if you find me interesting here's what I

[1:11:21] will communicate to you and that's

[1:11:23] stored in v and so V is the thing that

[1:11:26] gets aggregated for the purposes of this

[1:11:28] single head between the different

[1:11:30] notes and that's uh basically the self

[1:11:34] attention mechanism this is this is what

[1:11:36] it does there are a few notes that I

[1:11:39] would make like to make about attention

[1:11:41] number one attention is a communication

[1:11:44] mechanism you can really think about it

[1:11:46] as a communication mechanism where you

[1:11:48] have a number of nodes in a directed

[1:11:50] graph where basically you have edges

[1:11:52] pointed between noes like

[1:11:53] this and what happens is every node has

[1:11:56] some Vector of information and it gets

[1:11:58] to aggregate information via a weighted

[1:12:01] sum from all of the nodes that point to

[1:12:03] it and this is done in a data dependent

[1:12:06] manner so depending on whatever data is

[1:12:08] actually stored that you should not at

[1:12:09] any point in time now our graph doesn't

[1:12:13] look like this our graph has a different

[1:12:15] structure we have eight nodes because

[1:12:17] the block size is eight and there's

[1:12:18] always eight to

[1:12:20] tokens and uh the first node is only

[1:12:23] pointed to by itself the second node is

[1:12:25] pointed to by the first node and itself

[1:12:27] all the way up to the eighth node which

[1:12:29] is pointed to by all the previous nodes

[1:12:32] and itself and so that's the structure

[1:12:34] that our directed graph has or happens

[1:12:37] happens to have in Auto regressive sort

[1:12:38] of scenario like language modeling but

[1:12:41] in principle attention can be applied to

[1:12:42] any arbitrary directed graph and it's

[1:12:44] just a communication mechanism between

[1:12:46] the nodes the second note is that notice

[1:12:48] that there is no notion of space so

[1:12:51] attention simply acts over like a set of

[1:12:53] vectors in this graph and so by default

[1:12:56] these nodes have no idea where they are

[1:12:58] positioned in the space and that's why

[1:12:59] we need to encode them positionally and

[1:13:02] sort of give them some information that

[1:13:03] is anchored to a specific position so

[1:13:05] that they sort of know where they are

[1:13:08] and this is different than for example

[1:13:09] from convolution because if you're run

[1:13:11] for example a convolution operation over

[1:13:13] some input there's a very specific sort

[1:13:15] of layout of the information in space

[1:13:18] and the convolutional filters sort of

[1:13:20] act in space and so it's it's not like

[1:13:23] an attention in ATT ention is just a set

[1:13:26] of vectors out there in space they

[1:13:27] communicate and if you want them to have

[1:13:29] a notion of space you need to

[1:13:31] specifically add it which is what we've

[1:13:33] done when we calculated the um relative

[1:13:36] the positional encode encodings and

[1:13:38] added that information to the vectors

[1:13:40] the next thing that I hope is very clear

[1:13:41] is that the elements across the batch

[1:13:43] Dimension which are independent examples

[1:13:45] never talk to each other they're always

[1:13:47] processed independently and this is a

[1:13:49] batched matrix multiply that applies

[1:13:51] basically a matrix multiplication uh

[1:13:53] kind of in parallel across the batch

[1:13:54] dimension so maybe it would be more

[1:13:56] accurate to say that in this analogy of

[1:13:58] a directed graph we really have because

[1:14:00] the back size is four we really have

[1:14:03] four separate pools of eight nodes and

[1:14:05] those eight nodes only talk to each

[1:14:07] other but in total there's like 32 nodes

[1:14:08] that are being processed uh but there's

[1:14:11] um sort of four separate pools of eight

[1:14:13] you can look at it that way the next

[1:14:15] note is that here in the case of

[1:14:18] language modeling uh we have this

[1:14:20] specific uh structure of directed graph

[1:14:22] where the future tokens will not

[1:14:24] communicate to the Past tokens but this

[1:14:27] doesn't necessarily have to be the

[1:14:28] constraint in the general case and in

[1:14:30] fact in many cases you may want to have

[1:14:32] all of the uh noes talk to each other uh

[1:14:35] fully so as an example if you're doing

[1:14:37] sentiment analysis or something like

[1:14:38] that with a Transformer you might have a

[1:14:40] number of tokens and you may want to

[1:14:42] have them all talk to each other fully

[1:14:45] because later you are predicting for

[1:14:46] example the sentiment of the sentence

[1:14:49] and so it's okay for these NOS to talk

[1:14:50] to each other and so in those cases you

[1:14:53] will use an encoder block of self

[1:14:55] attention and uh all it means that it's

[1:14:58] an encoder block is that you will delete

[1:15:00] this line of code allowing all the noes

[1:15:02] to completely talk to each other what

[1:15:04] we're implementing here is sometimes

[1:15:06] called a decoder block and it's called a

[1:15:09] decoder because it is sort of like a

[1:15:12] decoding language and it's got this

[1:15:15] autor regressive format where you have

[1:15:17] to mask with the Triangular Matrix so

[1:15:19] that uh nodes from the future never talk

[1:15:22] to the Past because they would give away

[1:15:24] the answer

[1:15:25] and so basically in encoder blocks you

[1:15:27] would delete this allow all the noes to

[1:15:29] talk in decoder blocks this will always

[1:15:31] be present so that you have this

[1:15:33] triangular structure uh but both are

[1:15:35] allowed and attention doesn't care

[1:15:36] attention supports arbitrary

[1:15:38] connectivity between nodes the next

[1:15:40] thing I wanted to comment on is you keep

[1:15:41] me you keep hearing me say attention

[1:15:43] self attention Etc there's actually also

[1:15:45] something called cross attention what is

[1:15:47] the

[1:15:47] difference

[1:15:49] so basically the reason this attention

[1:15:52] is self attention is because because the

[1:15:55] keys queries and the values are all

[1:15:57] coming from the same Source from X so

[1:16:01] the same Source X produces Keys queries

[1:16:03] and values so these nodes are self

[1:16:05] attending but in principle attention is

[1:16:08] much more General than that so for

[1:16:10] example an encoder decoder Transformers

[1:16:12] uh you can have a case where the queries

[1:16:15] are produced from X but the keys and the

[1:16:17] values come from a whole separate

[1:16:18] external source and sometimes from uh

[1:16:21] encoder blocks that encode some context

[1:16:23] that we'd like to condition on

[1:16:25] and so the keys and the values will

[1:16:26] actually come from a whole separate

[1:16:28] Source those are nodes on the side and

[1:16:31] here we're just producing queries and

[1:16:32] we're reading off information from the

[1:16:34] side so cross attention is used when

[1:16:37] there's a separate source of nodes we'd

[1:16:40] like to pull information from into our

[1:16:42] nodes and it's self attention if we just

[1:16:45] have nodes that would like to look at

[1:16:46] each other and talk to each other so

[1:16:48] this attention here happens to be self

[1:16:51] attention but in principle um attention

[1:16:55] is a lot more General okay and the last

[1:16:57] note at this stage is if we come to the

[1:16:59] attention is all need paper here we've

[1:17:01] already implemented attention so given

[1:17:03] query key and value we've U multiplied

[1:17:06] the query and a key we've soft maxed it

[1:17:09] and then we are aggregating the values

[1:17:11] there's one more thing that we're

[1:17:12] missing here which is the dividing by

[1:17:13] one / square root of the head size the

[1:17:16] DK here is the head size why are they

[1:17:18] doing this finds this important so they

[1:17:21] call it the scaled attention and it's

[1:17:24] kind of like an important normalization

[1:17:25] to basically

[1:17:26] have the problem is if you have unit gsh

[1:17:29] and inputs so zero mean unit variance K

[1:17:32] and Q are unit gashin then if you just

[1:17:34] do we naively then you see that your we

[1:17:37] actually will be uh the variance will be

[1:17:38] on the order of head size which in our

[1:17:40] case is 16 but if you multiply by one

[1:17:43] over head size square root so this is

[1:17:45] square root and this is one

[1:17:47] over then the variance of we will be one

[1:17:50] so it will be

[1:17:52] preserved now why is this important

[1:17:54] you'll not notice that way

[1:17:56] here will feed into

[1:17:58] softmax and so it's really important

[1:18:00] especially at initialization that we be

[1:18:03] fairly diffuse so in our case here we

[1:18:06] sort of locked out here and we had a

[1:18:10] fairly diffuse numbers here so um like

[1:18:13] this now the problem is that because of

[1:18:15] softmax if weight takes on very positive

[1:18:18] and very negative numbers inside it

[1:18:20] softmax will actually converge towards

[1:18:22] one hot vectors and so I can illustrate

[1:18:25] that here um say we are applying softmax

[1:18:29] to a tensor of values that are very

[1:18:31] close to zero then we're going to get a

[1:18:33] diffuse thing out of

[1:18:34] softmax but the moment I take the exact

[1:18:36] same thing and I start sharpening it

[1:18:38] making it bigger by multiplying these

[1:18:40] numbers by eight for example you'll see

[1:18:42] that the softmax will start to sharpen

[1:18:44] and in fact it will sharpen towards the

[1:18:46] max so it will sharpen towards whatever

[1:18:48] number here is the highest and so um

[1:18:51] basically we don't want these values to

[1:18:52] be too extreme especially at

[1:18:53] initialization otherwise softmax will be

[1:18:55] way too peaky and um you're basically

[1:18:58] aggregating um information from like a

[1:19:01] single node every node just agregates

[1:19:03] information from a single other node

[1:19:04] that's not what we want especially at

[1:19:06] initialization and so the scaling is

[1:19:08] used just to control the variance at

[1:19:11] initialization okay so having said all

[1:19:13] that let's now take our self attention

[1:19:15] knowledge and let's uh take it for a

[1:19:17] spin so here in the code I created this

[1:19:19] head module and it implements a single

[1:19:22] head of self attention so you give it a

[1:19:24] head size and then here it creates the

[1:19:26] key query and the value linear layers

[1:19:29] typically people don't use biases in

[1:19:31] these uh so those are the linear

[1:19:33] projections that we're going to apply to

[1:19:34] all of our nodes now here I'm creating

[1:19:37] this Trill variable Trill is not a

[1:19:40] parameter of the module so in sort of

[1:19:41] pytorch naming conventions uh this is

[1:19:43] called a buffer it's not a parameter and

[1:19:46] you have to call it you have to assign

[1:19:47] it to the module using a register buffer

[1:19:49] so that creates the trill uh the triang

[1:19:52] lower triangular Matrix and we're given

[1:19:55] the input X this should look very

[1:19:56] familiar now we calculate the keys the

[1:19:58] queries we C calculate the attention

[1:20:00] scores inside way uh we normalize it so

[1:20:03] we're using scaled attention here then

[1:20:06] we make sure that uh future doesn't

[1:20:08] communicate with the past so this makes

[1:20:10] it a decoder block and then softmax and

[1:20:13] then aggregate the value and

[1:20:15] output then here in the language model

[1:20:17] I'm creating a head in the Constructor

[1:20:20] and I'm calling it self attention head

[1:20:22] and the head size I'm going to keep as

[1:20:24] the same and embed just for

[1:20:27] now and then here once we've encoded the

[1:20:31] information with the token embeddings

[1:20:32] and the position embeddings we're simply

[1:20:34] going to feed it into the self attention

[1:20:36] head and then the output of that is

[1:20:38] going to go into uh the decoder language

[1:20:42] modeling head and create the logits so

[1:20:44] this the sort of the simplest way to

[1:20:46] plug in a self attention component uh

[1:20:49] into our Network right now I had to make

[1:20:51] one more change which is that here in

[1:20:55] the generate uh we have to make sure

[1:20:57] that our idx that we feed into the model

[1:21:01] because now we're using positional

[1:21:02] embeddings we can never have more than

[1:21:04] block size coming in because if idx is

[1:21:07] more than block size then our position

[1:21:09] embedding table is going to run out of

[1:21:11] scope because it only has embeddings for

[1:21:12] up to block size and so therefore I

[1:21:15] added some uh code here to crop the

[1:21:17] context that we're going to feed into

[1:21:20] self um so that uh we never pass in more

[1:21:23] than block siiz elements

[1:21:25] so those are the changes and let's Now

[1:21:27] train the network okay so I also came up

[1:21:29] to the script here and I decreased the

[1:21:30] learning rate because uh the self

[1:21:32] attention can't tolerate very very high

[1:21:34] learning rates and then I also increased

[1:21:36] number of iterations because the

[1:21:37] learning rate is lower and then I

[1:21:39] trained it and previously we were only

[1:21:41] able to get to up to 2.5 and now we are

[1:21:43] down to 2.4 so we definitely see a

[1:21:46] little bit of an improvement from 2.5 to

[1:21:48] 2.4 roughly uh but the text is still not

[1:21:51] amazing so clearly the self attention

[1:21:53] head is doing some useful communication

[1:21:56] but um we still have a long way to go

[1:21:59] okay so now we've implemented the scale.

[1:22:01] product attention now next up and the

[1:22:02] attention is all you need paper there's

[1:22:05] something called multi-head attention

[1:22:07] and what is multi-head attention it's

[1:22:09] just applying multiple attentions in

[1:22:11] parallel and concatenating their results

[1:22:13] so they have a little bit of diagram

[1:22:15] here I don't know if this is super clear

[1:22:18] it's really just multiple attentions in

[1:22:20] parallel so let's Implement that fairly

[1:22:23] straightforward

[1:22:25] if we want a multi-head attention then

[1:22:27] we want multiple heads of self attention

[1:22:28] running in parallel so in pytorch we can

[1:22:32] do this by simply creating multiple

[1:22:35] heads so however heads how however many

[1:22:38] heads you want and then what is the head

[1:22:39] size of each and then we run all of them

[1:22:43] in parallel into a list and simply

[1:22:46] concatenate all of the outputs and we're

[1:22:48] concatenating over the channel

[1:22:50] Dimension so the way this looks now is

[1:22:53] we don't have just a single ATT

[1:22:56] that uh has a hit size of 32 because

[1:22:59] remember n Ed is

[1:23:00] 32 instead of having one Communication

[1:23:03] channel we now have four communication

[1:23:06] channels in parallel and each one of

[1:23:08] these communication channels typically

[1:23:10] will be uh smaller uh correspondingly so

[1:23:14] because we have four communication

[1:23:15] channels we want eight dimensional self

[1:23:18] attention and so from each Communication

[1:23:20] channel we're going to together eight

[1:23:22] dimensional vectors and then we have

[1:23:23] four of them and that concatenates to

[1:23:25] give us 32 which is the original and

[1:23:28] embed and so this is kind of similar to

[1:23:30] um if you're familiar with convolutions

[1:23:32] this is kind of like a group convolution

[1:23:34] uh because basically instead of having

[1:23:36] one large convolution we do convolution

[1:23:38] in groups and uh that's multi-headed

[1:23:40] self

[1:23:41] attention and so then here we just use

[1:23:44] essay heads self attention heads instead

[1:23:47] now I actually ran it and uh scrolling

[1:23:51] down I ran the same thing and then we

[1:23:53] now get this down to 2.28 roughly and

[1:23:57] the output is still the generation is

[1:23:58] still not amazing but clearly the

[1:24:00] validation loss is improving because we

[1:24:02] were at 2.4 just now and so it helps to

[1:24:05] have multiple communication channels

[1:24:07] because obviously these tokens have a

[1:24:09] lot to talk about they want to find the

[1:24:11] consonants the vowels they want to find

[1:24:13] the vowels just from certain positions

[1:24:15] uh they want to find any kinds of

[1:24:17] different things and so it helps to

[1:24:19] create multiple independent channels of

[1:24:20] communication gather lots of different

[1:24:22] types of data and then uh decode the

[1:24:25] output now going back to the paper for a

[1:24:27] second of course I didn't explain this

[1:24:28] figure in full detail but we are

[1:24:30] starting to see some components of what

[1:24:32] we've already implemented we have the

[1:24:33] positional encodings the token encodings

[1:24:35] that add we have the masked multi-headed

[1:24:37] attention implemented now here's another

[1:24:41] multi-headed attention which is a cross

[1:24:42] attention to an encoder which we haven't

[1:24:45] we're not going to implement in this

[1:24:46] case I'm going to come back to that

[1:24:48] later but I want you to notice that

[1:24:50] there's a feed forward part here and

[1:24:52] then this is grouped into a block that

[1:24:53] gets repeat it again and again now the

[1:24:56] feedforward part here is just a simple

[1:24:57] uh multi-layer perceptron

[1:25:00] um so the multi-headed so here position

[1:25:04] wise feed forward networks is just a

[1:25:06] simple little MLP so I want to start

[1:25:08] basically in a similar fashion also

[1:25:10] adding computation into the network and

[1:25:13] this computation is on a per node level

[1:25:16] so I've already implemented it and you

[1:25:18] can see the diff highlighted on the left

[1:25:20] here when I've added or changed things

[1:25:22] now before we had the self multi-headed

[1:25:25] self attention that did the

[1:25:26] communication but we went way too fast

[1:25:28] to calculate the logits so the tokens

[1:25:31] looked at each other but didn't really

[1:25:32] have a lot of time to think on what they

[1:25:35] found from the other tokens and so what

[1:25:38] I've implemented here is a little feet

[1:25:40] forward single layer and this little

[1:25:42] layer is just a linear followed by a Rel

[1:25:45] nonlinearity and that's that's it so

[1:25:48] it's just a little layer and then I call

[1:25:50] it feed

[1:25:52] forward um and embed

[1:25:54] and then this feed forward is just

[1:25:56] called sequentially right after the self

[1:25:58] attention so we self attend then we feed

[1:26:01] forward and you'll notice that the feet

[1:26:02] forward here when it's applying linear

[1:26:04] this is on a per token level all the

[1:26:06] tokens do this independently so the self

[1:26:09] attention is the communication and then

[1:26:11] once they've gathered all the data now

[1:26:13] they need to think on that data

[1:26:15] individually and so that's what feed

[1:26:16] forward is doing and that's why I've

[1:26:18] added it here now when I train this the

[1:26:21] validation LW actually continues to go

[1:26:23] down now to 2. 24 which is down from

[1:26:26] 2.28 uh the output still look kind of

[1:26:28] terrible but at least we've improved the

[1:26:31] situation and so as a preview we're

[1:26:34] going to now start to intersperse the

[1:26:37] communication with the computation and

[1:26:39] that's also what the Transformer does

[1:26:42] when it has blocks that communicate and

[1:26:44] then compute and it groups them and

[1:26:46] replicates them okay so let me show you

[1:26:49] what we'd like to do we'd like to do

[1:26:51] something like this we have a block and

[1:26:53] this block is is basically this part

[1:26:55] here except for the cross

[1:26:57] attention now the block basically

[1:26:59] intersperses communication and then

[1:27:01] computation the computation the

[1:27:03] communication is done using multi-headed

[1:27:05] selfelf attention and then the

[1:27:07] computation is done using a feed forward

[1:27:08] Network on all the tokens

[1:27:11] independently now what I've added here

[1:27:14] also is you'll

[1:27:16] notice this takes the number of

[1:27:18] embeddings in the embedding Dimension

[1:27:19] and number of heads that we would like

[1:27:21] which is kind of like group size in

[1:27:22] group convolution and and I'm saying

[1:27:24] that number of heads we'd like is four

[1:27:26] and so because this is 32 we calculate

[1:27:29] that because this is 32 the number of

[1:27:31] heads should be four um the head size

[1:27:34] should be eight so that everything sort

[1:27:36] of works out Channel wise um so this is

[1:27:39] how the Transformer structures uh sort

[1:27:41] of the uh the sizes typically so the

[1:27:44] head size will become eight and then

[1:27:45] this is how we want to intersperse them

[1:27:47] and then here I'm trying to create

[1:27:49] blocks which is just a sequential

[1:27:51] application of block block block so that

[1:27:53] we're interspersing communication feed

[1:27:55] forward many many times and then finally

[1:27:57] we decode now I actually tried to run

[1:28:01] this and the problem is this doesn't

[1:28:02] actually give a very good uh answer and

[1:28:05] very good result and the reason for that

[1:28:07] is we're start starting to actually get

[1:28:09] like a pretty deep neural net and deep

[1:28:11] neural Nets uh suffer from optimization

[1:28:13] issues and I think that's what we're

[1:28:14] kind of like slightly starting to run

[1:28:16] into so we need one more idea that we

[1:28:18] can borrow from the um Transformer paper

[1:28:21] to resolve those difficulties now there

[1:28:23] are two optimizations that dramatically

[1:28:25] help with the depth of these networks

[1:28:27] and make sure that the networks remain

[1:28:29] optimizable let's talk about the first

[1:28:31] one the first one in this diagram is you

[1:28:33] see this Arrow here and then this arrow

[1:28:36] and this Arrow those are skip

[1:28:38] connections or sometimes called residual

[1:28:40] connections they come from this paper uh

[1:28:43] the presidual learning for image

[1:28:44] recognition from about

[1:28:46] 2015 uh that introduced the concept now

[1:28:51] these are basically what it means is you

[1:28:53] transform data but then you have a skip

[1:28:55] connection with addition from the

[1:28:57] previous features now the way I like to

[1:29:00] visualize it uh that I prefer is the

[1:29:03] following here the computation happens

[1:29:05] from the top to bottom and basically you

[1:29:08] have this uh residual pathway and you

[1:29:11] are free to Fork off from the residual

[1:29:13] pathway perform some computation and

[1:29:15] then project back to the residual

[1:29:16] pathway via addition and so you go from

[1:29:19] the the uh inputs to the targets only

[1:29:22] via plus and plus plus and the reason

[1:29:25] this is useful is because during back

[1:29:27] propagation remember from our microG

[1:29:29] grad video earlier addition distributes

[1:29:32] gradients equally to both of its

[1:29:34] branches that that fed as the input and

[1:29:37] so the supervision or the gradients from

[1:29:40] the loss basically hop through every

[1:29:43] addition node all the way to the input

[1:29:46] and then also Fork off into the residual

[1:29:50] blocks but basically you have this

[1:29:52] gradient Super Highway that goes

[1:29:53] directly from the supervision all the

[1:29:55] way to the input unimpeded and then

[1:29:58] these viral blocks are usually

[1:29:59] initialized in the beginning so they

[1:30:01] contribute very very little if anything

[1:30:03] to the residual pathway they they are

[1:30:05] initialized that way so in the beginning

[1:30:07] they are sort of almost kind of like not

[1:30:09] there but then during the optimization

[1:30:11] they come online over time and they uh

[1:30:14] start to contribute but at least at the

[1:30:17] initialization you can go from directly

[1:30:19] supervision to the input gradient is

[1:30:21] unimpeded and just flows and then the

[1:30:23] blocks over time

[1:30:24] kick in and so that dramatically helps

[1:30:27] with the optimization so let's implement

[1:30:29] this so coming back to our block here

[1:30:31] basically what we want to do is we want

[1:30:33] to do xal

[1:30:35] X+ self attention and xal X+ self. feed

[1:30:39] forward so this is X and then we Fork

[1:30:43] off and do some communication and come

[1:30:45] back and we Fork off and we do some

[1:30:46] computation and come back so those are

[1:30:49] residual connections and then swinging

[1:30:51] back up here we also have to introd use

[1:30:54] this projection so nn.

[1:30:57] linear and uh this is going to be

[1:31:00] from after we concatenate this this is

[1:31:03] the prze and embed so this is the output

[1:31:05] of the self tension itself but then we

[1:31:08] actually want the uh to apply the

[1:31:11] projection and that's the

[1:31:13] result so the projection is just a

[1:31:15] linear transformation of the outcome of

[1:31:16] this

[1:31:17] layer so that's the projection back into

[1:31:20] the virual pathway and then here in a

[1:31:22] feet forward it's going to be the same

[1:31:23] same thing I could have a a self doot

[1:31:26] projection here as well but let me just

[1:31:28] simplify it and let me uh couple it

[1:31:32] inside the same sequential container and

[1:31:34] so this is the projection layer going

[1:31:36] back into the residual

[1:31:38] pathway and

[1:31:40] so that's uh well that's it so now we

[1:31:43] can train this so I implemented one more

[1:31:44] small change when you look into the

[1:31:47] paper again you see that the

[1:31:49] dimensionality of input and output is

[1:31:51] 512 for them and they're saying that the

[1:31:53] inner layer here in the feet forward has

[1:31:55] dimensionality of 248 so there's a

[1:31:57] multiplier of four and so the inner

[1:32:00] layer of the feet forward Network should

[1:32:02] be multiplied by four in terms of

[1:32:04] Channel sizes so I came here and I

[1:32:06] multiplied four times embed here for the

[1:32:08] feed forward and then from four times

[1:32:10] nmed coming back down to nmed when we go

[1:32:13] back to the pro uh to the projection so

[1:32:15] adding a bit of computation here and

[1:32:17] growing that layer that is in the

[1:32:19] residual block on the side of the

[1:32:21] residual

[1:32:22] pathway and then I train this and we

[1:32:24] actually get down all the way to uh 2.08

[1:32:27] validation loss and we also see that

[1:32:29] network is starting to get big enough

[1:32:30] that our train loss is getting ahead of

[1:32:32] validation loss so we're starting to see

[1:32:33] like a little bit of

[1:32:35] overfitting and um our our

[1:32:38] um uh Generations here are still not

[1:32:41] amazing but at least you see that we can

[1:32:42] see like is here this now grief syn like

[1:32:46] this starts to almost look like English

[1:32:48] so um yeah we're starting to really get

[1:32:50] there okay and the second Innovation

[1:32:52] that is very helpful for optimizing very

[1:32:54] deep neural networks is right here so we

[1:32:57] have this addition now that's the

[1:32:58] residual part but this Norm is referring

[1:33:00] to something called layer Norm so layer

[1:33:03] Norm is implemented in pytorch it's a

[1:33:04] paper that came out a while back here

[1:33:09] um and layer Norm is very very similar

[1:33:11] to bash Norm so remember back to our

[1:33:14] make more series part three we

[1:33:16] implemented bash

[1:33:17] normalization and uh bash normalization

[1:33:19] basically just made sure that um Across

[1:33:22] The Bash dimension any individual neuron

[1:33:25] had unit uh Gan um distribution so it

[1:33:30] was zero mean and unit standard

[1:33:32] deviation one standard deviation output

[1:33:35] so what I did here is I'm copy pasting

[1:33:37] the bashor 1D that we developed in our

[1:33:39] make more series and see here we can

[1:33:42] initialize for example this module and

[1:33:44] we can have a batch of 32 100

[1:33:47] dimensional vectors feeding through the

[1:33:48] bachor layer so what this does is it

[1:33:52] guarantees that when we look at just the

[1:33:54] zeroth column it's a zero mean one

[1:33:58] standard deviation so it's normalizing

[1:34:00] every single column of this uh input now

[1:34:04] the rows are not uh going to be

[1:34:06] normalized by default because we're just

[1:34:08] normalizing columns so let's now

[1:34:10] Implement layer Norm uh it's very

[1:34:12] complicated look we come here we change

[1:34:15] this from zero to one so we don't

[1:34:18] normalize The Columns we normalize the

[1:34:20] rows and now we've implemented layer

[1:34:23] Norm

[1:34:25] so now the columns are not going to be

[1:34:28] normalized um but the rows are going to

[1:34:31] be normalized for every individual

[1:34:33] example it's 100 dimensional Vector is

[1:34:35] normalized uh in this way and because

[1:34:38] our computation Now does not span across

[1:34:40] examples we can delete all of this

[1:34:43] buffers stuff uh because uh we can

[1:34:45] always apply this operation and don't

[1:34:48] need to maintain any running buffers so

[1:34:50] we don't need the

[1:34:52] buffers uh we

[1:34:54] don't There's no distinction between

[1:34:56] training and test

[1:34:58] time uh and we don't need these running

[1:35:00] buffers we do keep gamma and beta we

[1:35:03] don't need the momentum we don't care if

[1:35:05] it's training or not and this is now a

[1:35:08] layer

[1:35:09] norm and it normalizes the rows instead

[1:35:12] of the columns and this here is

[1:35:15] identical to basically this here so

[1:35:19] let's now Implement layer Norm in our

[1:35:21] Transformer before I incorporate the

[1:35:23] layer Norm I just wanted to note that as

[1:35:25] I said very few details about the

[1:35:27] Transformer have changed in the last 5

[1:35:28] years but this is actually something

[1:35:30] that slightly departs from the original

[1:35:31] paper you see that the ADD and Norm is

[1:35:34] applied after the

[1:35:36] transformation but um in now it is a bit

[1:35:40] more uh basically common to apply the

[1:35:42] layer Norm before the transformation so

[1:35:44] there's a reshuffling of the layer Norms

[1:35:46] uh so this is called the prorm

[1:35:48] formulation and that's the one that

[1:35:49] we're going to implement as well so

[1:35:50] select deviation from the original paper

[1:35:53] basically we need two layer Norms layer

[1:35:55] Norm one is uh NN do layer norm and we

[1:35:59] tell it how many um what is the

[1:36:01] embedding Dimension and we need the

[1:36:03] second layer norm and then here the

[1:36:06] layer Norms are applied immediately on X

[1:36:09] so self. layer Norm one applied on X and

[1:36:13] self. layer Norm two applied on X before

[1:36:15] it goes into self attention and feed

[1:36:18] forward and uh the size of the layer

[1:36:20] Norm here is an ed so 32 so when the

[1:36:23] layer Norm is normalizing our features

[1:36:26] it is uh the normalization here uh

[1:36:30] happens the mean and the variance are

[1:36:32] taken over 32 numbers so the batch and

[1:36:34] the time act as batch Dimensions both of

[1:36:37] them so this is kind of like a per token

[1:36:40] um transformation that just normalizes

[1:36:42] the features and makes them a unit mean

[1:36:46] uh unit Gan at

[1:36:48] initialization but of course because

[1:36:50] these layer Norms inside it have these

[1:36:52] gamma and beta training

[1:36:54] parameters uh the layer Norm will U

[1:36:57] eventually create outputs that might not

[1:36:59] be unit gion but the optimization will

[1:37:01] determine that so for now this is the uh

[1:37:05] this is incorporating the layer norms

[1:37:06] and let's train them on okay so I let it

[1:37:09] run and we see that we get down to 2.06

[1:37:12] which is better than the previous 2.08

[1:37:14] so a slight Improvement by adding the

[1:37:15] layer norms and I'd expect that they

[1:37:17] help uh even more if we had bigger and

[1:37:19] deeper Network one more thing I forgot

[1:37:21] to add is that there should be a layer

[1:37:23] Norm here also typically as at the end

[1:37:26] of the Transformer and right before the

[1:37:28] final uh linear layer that decodes into

[1:37:31] vocabulary so I added that as well so at

[1:37:35] this stage we actually have a pretty

[1:37:36] complete uh Transformer according to the

[1:37:38] original paper and it's a decoder only

[1:37:40] Transformer I'll I'll talk about that in

[1:37:42] a second uh but at this stage uh the

[1:37:44] major pieces are in place so we can try

[1:37:46] to scale this up and see how well we can

[1:37:47] push this number now in order to scale

[1:37:50] out the model I had to perform some

[1:37:51] cosmetic changes here to make it nicer

[1:37:54] so I introduced this variable called n

[1:37:56] layer which just specifies how many

[1:37:57] layers of the blocks we're going to have

[1:38:01] I created a bunch of blocks and we have

[1:38:02] a new variable number of heads as well I

[1:38:05] pulled out the layer Norm here and uh so

[1:38:07] this is identical now one thing that I

[1:38:10] did briefly change is I added a Dropout

[1:38:13] so Dropout is something that you can add

[1:38:15] right before the residual connection

[1:38:17] back right before the connection back

[1:38:19] into the residual pathway so we can drop

[1:38:22] out that as l layer here we can drop out

[1:38:26] uh here at the end of the multi-headed

[1:38:27] exension as well and we can also drop

[1:38:30] out here uh when we calculate the um

[1:38:34] basically affinities and after the

[1:38:36] softmax we can drop out some of those so

[1:38:38] we can randomly prevent some of the

[1:38:40] nodes from

[1:38:41] communicating and so Dropout uh comes

[1:38:43] from this paper from 2014 or so and

[1:38:49] basically it takes your neural

[1:38:50] nut and it randomly every forward

[1:38:53] backward pass shuts off some subset of

[1:38:56] uh neurons so randomly drops them to

[1:38:59] zero and trains without them and what

[1:39:02] this does effectively is because the

[1:39:04] mask of what's being dropped out is

[1:39:06] changed every single forward backward

[1:39:07] pass it ends up kind of uh training an

[1:39:11] ensemble of sub networks and then at

[1:39:13] test time everything is fully enabled

[1:39:15] and kind of all of those sub networks

[1:39:16] are merged into a single Ensemble if you

[1:39:18] can if you want to think about it that

[1:39:20] way so I would read the paper to get the

[1:39:22] full detail for now we're just going to

[1:39:24] stay on the level of this is a

[1:39:25] regularization technique and I added it

[1:39:28] because I'm about to scale up the model

[1:39:30] quite a bit and I was concerned about

[1:39:32] overfitting so now when we scroll up to

[1:39:34] the top uh we'll see that I changed a

[1:39:36] number of hyper parameters here about

[1:39:38] our neural nut so I made the batch size

[1:39:40] be much larger now it's 64 I changed the

[1:39:43] block size to be 256 so previously it

[1:39:46] was just eight eight characters of

[1:39:47] context now it is 256 characters of

[1:39:50] context to predict the 257th

[1:39:54] uh I brought down the learning rate a

[1:39:55] little bit because the neural net is now

[1:39:57] much bigger so I brought down the

[1:39:58] learning rate the embedding Dimension is

⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.