What is ChatGPT?
45sOpens with a relatable demo of ChatGPT writing a haiku, instantly hooking viewers curious about AI.
▶ Play ClipThis video provides a step-by-step tutorial on building a GPT-like Transformer language model from scratch using Python and PyTorch. The presenter explains the core concepts of the Transformer architecture, including self-attention, multi-head attention, and feed-forward networks, and implements them in code. The tutorial culminates in training a character-level language model on the Tiny Shakespeare dataset to generate Shakespeare-like text.
ChatGPT is a probabilistic system that generates text based on prompts. It is a language model that models sequences of tokens.
The Transformer architecture was introduced in the 2017 paper 'Attention is All You Need'. GPT stands for 'Generatively Pre-trained Transformer'.
The tutorial uses the Tiny Shakespeare dataset (1MB, concatenated works of Shakespeare) to train a character-level language model.
The code is available in the nanoGPT repository on GitHub, consisting of two files (model.py and train.py) of about 300 lines each.
Character-level tokenization is used: each character is mapped to an integer. The vocabulary size is 65 characters.
Data is processed in chunks (blocks) of size block_size. Each chunk contains multiple training examples (predicting next character given context).
A simple bigram model is implemented first: it predicts the next character based solely on the current character using an embedding table.
Self-attention allows tokens to communicate with each other. It uses queries, keys, and values to compute weighted averages of past tokens.
Multiple self-attention heads run in parallel and their outputs are concatenated, allowing the model to attend to different types of information.
After self-attention, a feed-forward network (MLP) is applied per token. Residual connections and layer normalization help with training deep networks.
By increasing model size (embedding dimension 384, 6 heads, 6 layers, block size 256), the validation loss drops to 1.48, generating more coherent Shakespeare-like text.
The implemented model is a decoder-only Transformer (like GPT), suitable for unconditional text generation. The original Transformer paper used an encoder-decoder for translation.
Pre-training trains a language model on internet text. Fine-tuning (e.g., with reinforcement learning from human feedback) aligns the model to be an assistant like ChatGPT.
This tutorial successfully builds a decoder-only Transformer from scratch, demonstrating the core components of GPT. The final model, trained on Tiny Shakespeare, generates plausible Shakespeare-like text, illustrating the power of the Transformer architecture.
"Title accurately describes the content: building GPT from scratch with code explanations."
What does GPT stand for?
Generatively Pre-trained Transformer
02:31
What is the name of the 2017 paper that introduced the Transformer architecture?
Attention is All You Need
02:17
What is the vocabulary size of the character-level tokenizer used in the tutorial?
65
09:04
What is the purpose of the 'block_size' hyperparameter?
It defines the maximum context length for predictions.
15:04
How does a bigram language model predict the next token?
It predicts based solely on the current token, using an embedding table.
24:12
What are the three vectors computed in self-attention?
Query, key, and value.
64:00
Why is the attention score divided by the square root of the head size?
To control variance and prevent softmax from becoming too peaky at initialization.
77:13
What is the difference between self-attention and cross-attention?
In self-attention, keys, queries, and values come from the same source. In cross-attention, queries come from one source and keys/values from another.
75:49
What is the purpose of residual connections in deep Transformers?
They allow gradients to flow directly from the loss to the input, improving optimization of deep networks.
88:31
What is the approximate number of parameters in the final model trained in the tutorial?
10 million
109:32
Transformer Origin
The 2017 paper 'Attention is All You Need' introduced the Transformer, which became the foundation for GPT.
02:17Self-Attention Communication
Self-attention allows tokens to communicate with each other via queries, keys, and values, enabling context-aware predictions.
36:00Multi-Head Attention
Multiple attention heads run in parallel, allowing the model to attend to different types of information simultaneously.
58:25Residual Connections for Deep Networks
Residual connections enable training of deep Transformers by providing a gradient superhighway.
88:31Pre-training vs Fine-tuning
Pre-training on internet text produces a document completer; fine-tuning aligns it to be an assistant like ChatGPT.
108:55[00:00] hi everyone so by now you have probably
[00:02] heard of chat GPT it has taken the world
[00:04] and AI Community by storm and it is a
[00:07] system that allows you to interact with
[00:09] an AI and give it text based tasks so
[00:12] for example we can ask chat GPT to write
[00:15] us a small Hau about how important it is
[00:16] that people understand Ai and then they
[00:18] can use it to improve the world and make
[00:20] it more prosperous so when we run this
[00:23] AI knowledge brings prosperity for all
[00:25] to see Embrace its
[00:27] power okay not bad and so you could see
[00:29] that chpt went from left to right and
[00:32] generated all these words SE sort of
[00:35] sequentially now I asked it already the
[00:37] exact same prompt a little bit earlier
[00:39] and it generated a slightly different
[00:41] outcome ai's power to grow ignorance
[00:44] holds us back learn Prosperity weights
[00:47] so uh pretty good in both cases and
[00:49] slightly different so you can see that
[00:50] chat GPT is a probabilistic system and
[00:52] for any one prompt it can give us
[00:54] multiple answers sort of uh replying to
[00:57] it now this is just one example of a
[00:59] problem people have come up with many
[01:01] many examples and there are entire
[01:03] websites that index interactions with
[01:06] chpt and so many of them are quite
[01:08] humorous explain HTML to me like I'm a
[01:10] dog uh write release notes for chess 2
[01:14] write a note about Elon Musk buying a
[01:16] Twitter and so on so as an example uh
[01:20] please write a breaking news article
[01:21] about a leaf falling from a
[01:23] tree uh and a shocking turn of events a
[01:26] leaf has fallen from a tree in the local
[01:28] park Witnesses report that the leaf
[01:30] which was previously attached to a
[01:31] branch of a tree attached itself and
[01:33] fell to the ground very dramatic so you
[01:36] can see that this is a pretty remarkable
[01:37] system and it is what we call a language
[01:40] model uh because it um it models the
[01:43] sequence of words or characters or
[01:46] tokens more generally and it knows how
[01:49] sort of words follow each other in
[01:50] English language and so from its
[01:52] perspective what it is doing is it is
[01:55] completing the sequence so I give it the
[01:57] start of a sequence and it completes the
[02:00] sequence with the outcome and so it's a
[02:02] language model in that sense now I would
[02:05] like to focus on the under the hood of
[02:07] um under the hood components of what
[02:09] makes CH GPT work so what is the neural
[02:12] network under the hood that models the
[02:14] sequence of these words and that comes
[02:17] from this paper called attention is all
[02:19] you need in 2017 a landmark paper a
[02:23] landmark paper in AI that produced and
[02:25] proposed the Transformer
[02:27] architecture so GPT is uh short for
[02:31] generally generatively pre-trained
[02:33] Transformer so Transformer is the neuron
[02:35] nut that actually does all the heavy
[02:36] lifting under the hood it comes from
[02:39] this paper in 2017 now if you read this
[02:41] paper this uh reads like a pretty random
[02:44] machine translation paper and that's
[02:46] because I think the authors didn't fully
[02:47] anticipate the impact that the
[02:49] Transformer would have on the field and
[02:51] this architecture that they produced in
[02:52] the context of machine translation in
[02:54] their case actually ended up taking over
[02:57] uh the rest of AI in the next 5 years
[03:00] after and so this architecture with
[03:02] minor changes was copy pasted into a
[03:05] huge amount of applications in AI in
[03:07] more recent years and that includes at
[03:10] the core of chat GPT now we are not
[03:13] going to what I'd like to do now is I'd
[03:15] like to build out something like chat
[03:17] GPT but uh we're not going to be able to
[03:19] of course reproduce chat GPT this is a
[03:21] very serious production grade system it
[03:23] is trained on uh a good chunk of
[03:26] internet and then there's a lot of uh
[03:29] pre-training and fine-tuning stages to
[03:31] it and so it's very complicated what I'd
[03:33] like to focus on is just to train a
[03:36] Transformer based language model and in
[03:38] our case it's going to be a character
[03:40] level language model I still think that
[03:43] is uh very educational with respect to
[03:45] how these systems work so I don't want
[03:47] to train on the chunk of Internet we
[03:48] need a smaller data set in this case I
[03:51] propose that we work with uh my favorite
[03:53] toy data set it's called tiny
[03:55] Shakespeare and um what it is is
[03:57] basically it's a concatenation of all of
[03:59] the works of sh Shakespeare in my
[04:00] understanding and so this is all of
[04:02] Shakespeare in a single file uh this
[04:05] file is about 1 megab and it's just all
[04:07] of
[04:08] Shakespeare and what we are going to do
[04:10] now is we're going to basically model
[04:12] how these characters uh follow each
[04:14] other so for example given a chunk of
[04:16] these characters like this uh given some
[04:19] context of characters in the past the
[04:22] Transformer neural network will look at
[04:24] the characters that I've highlighted and
[04:26] is going to predict that g is likely to
[04:28] come next in the sequence and it's going
[04:30] to do that because we're going to train
[04:31] that Transformer on Shakespeare and it's
[04:34] just going to try to produce uh
[04:36] character sequences that look like this
[04:39] and in that process is going to model
[04:40] all the patterns inside this data so
[04:43] once we've trained the system i' just
[04:45] like to give you a preview we can
[04:47] generate infinite Shakespeare and of
[04:49] course it's a fake thing that looks kind
[04:51] of like
[04:53] Shakespeare
[04:55] um apologies for there's some Jank that
[04:59] I'm not able to resolve in in here but
[05:02] um you can see how this is going
[05:05] character by character and it's kind of
[05:07] like predicting Shakespeare like
[05:09] language so verily my Lord the sites
[05:12] have left the again the king coming with
[05:15] my curses with precious pale and then
[05:19] tranos say something else Etc and this
[05:21] is just coming out of the Transformer in
[05:23] a very similar manner as it would come
[05:25] out in chat GPT in our case character by
[05:27] character in chat GPT uh it's coming out
[05:31] on the token by token level and tokens
[05:33] are these sort of like little subword
[05:35] pieces so they're not Word level they're
[05:36] kind of like word chunk
[05:38] level um and now I've already written
[05:43] this entire code uh to train these
[05:45] Transformers um and it is in a GitHub
[05:48] repository that you can find and it's
[05:50] called nanog
[05:51] GPT so nanog GPT is a repository that
[05:54] you can find in my GitHub and it's a
[05:56] repository for training Transformers um
[05:59] on any given text and what I think is
[06:02] interesting about it because there's
[06:03] many ways to train Transformers but this
[06:05] is a very simple implementation so it's
[06:06] just two files of 300 lines of code each
[06:10] one file defines the GPT model the
[06:12] Transformer and one file trains it on
[06:14] some given Text data set and here I'm
[06:17] showing that if you train it on a open
[06:18] web Text data set which is a fairly
[06:20] large data set of web pages then I
[06:22] reproduce the the performance of
[06:25] gpt2 so gpt2 is an early version of open
[06:29] AI GPT uh from 2017 if I recall
[06:32] correctly and I've only so far
[06:34] reproduced the the smallest 124 million
[06:36] parameter model uh but basically this is
[06:38] just proving that the codebase is
[06:39] correctly arranged and I'm able to load
[06:42] the uh neural network weights that openi
[06:45] has released later so you can take a
[06:48] look at the finished code here in N GPT
[06:50] but what I would like to do in this
[06:51] lecture is I would like to basically uh
[06:55] write this repository from scratch so
[06:57] we're going to begin with an empty file
[06:59] and we're we're going to define a
[07:00] Transformer piece by piece we're going
[07:03] to train it on the tiny Shakespeare data
[07:05] set and we'll see how we can then uh
[07:08] generate infinite Shakespeare and of
[07:10] course this can copy paste to any
[07:12] arbitrary Text data set uh that you like
[07:14] uh but my goal really here is to just
[07:16] make you understand and appreciate uh
[07:18] how under the hood chat GPT works and um
[07:22] really all that's required is a
[07:24] Proficiency in Python and uh some basic
[07:27] understanding of um calculus and
[07:29] statistics
[07:30] and it would help if you also see my
[07:32] previous videos on the same YouTube
[07:34] channel in particular my make more
[07:35] series where I um Define smaller and
[07:40] simpler neural network language models
[07:42] uh so multi perceptrons and so on it
[07:45] really introduces the language modeling
[07:46] framework and then uh here in this video
[07:49] we're going to focus on the Transformer
[07:50] neural network itself okay so I created
[07:53] a new Google collab uh jup notebook here
[07:57] and this will allow me to later easily
[07:58] share this code that we're going to
[08:00] develop together uh with you so you can
[08:01] follow along so this will be in a video
[08:03] description uh later now here I've just
[08:07] done some preliminaries I downloaded the
[08:09] data set the tiny Shakespeare data set
[08:10] at this URL and you can see that it's
[08:12] about a 1 Megabyte file then here I open
[08:15] the input.txt file and just read in all
[08:17] the text of the string and we see that
[08:20] we are working with 1 million characters
[08:22] roughly and the first 1,000 characters
[08:24] if we just print them out are basically
[08:26] what you would expect this is the first
[08:28] 1,000 characters of the tiny Shakespeare
[08:30] data set roughly up to here so so far so
[08:34] good next we're going to take this text
[08:37] and the text is a sequence of characters
[08:39] in Python so when I call the set
[08:41] Constructor on it I'm just going to get
[08:44] the set of all the characters that occur
[08:46] in this text and then I call list on
[08:49] that to create a list of those
[08:51] characters instead of just a set so that
[08:53] I have an ordering an arbitrary ordering
[08:56] and then I sort that so basically we get
[08:59] just all the characters that occur in
[09:00] the entire data set and they're sorted
[09:02] now the number of them is going to be
[09:04] our vocabulary size these are the
[09:06] possible elements of our sequences and
[09:09] we see that when I print here the
[09:11] characters there's 65 of them in total
[09:14] there's a space character and then all
[09:16] kinds of special characters and then U
[09:19] capitals and lowercase letters so that's
[09:21] our vocabulary and that's the sort of
[09:23] like possible uh characters that the
[09:25] model can see or emit okay so next we
[09:29] will would like to develop some strategy
[09:31] to tokenize the input text now when
[09:35] people say tokenize they mean convert
[09:36] the raw text as a string to some
[09:39] sequence of integers According to some
[09:41] uh notebook According to some vocabulary
[09:43] of possible elements so as an example
[09:46] here we are going to be building a
[09:48] character level language model so we're
[09:49] simply going to be translating
[09:50] individual characters into integers so
[09:53] let me show you uh a chunk of code that
[09:55] sort of does that for us so we're
[09:57] building both the encoder and the
[09:58] decoder
[10:00] and let me just talk through what's
[10:01] happening
[10:02] here when we encode an arbitrary text
[10:05] like hi there we're going to receive a
[10:08] list of integers that represents that
[10:10] string so for example 46 47 Etc and then
[10:14] we also have the reverse mapping so we
[10:17] can take this list and decode it to get
[10:20] back the exact same string so it's
[10:22] really just like a translation to
[10:24] integers and back for arbitrary string
[10:26] and for us it is done on a character
[10:28] level
[10:30] now the way this was achieved is we just
[10:31] iterate over all the characters here and
[10:34] create a lookup table from the character
[10:35] to the integer and vice versa and then
[10:38] to encode some string we simply
[10:40] translate all the characters
[10:41] individually and to decode it back we
[10:44] use the reverse mapping and concatenate
[10:46] all of it now this is only one of many
[10:49] possible encodings or many possible sort
[10:51] of tokenizers and it's a very simple one
[10:54] but there's many other schemas that
[10:55] people have come up with in practice so
[10:57] for example Google uses a sentence
[10:59] piece uh so sentence piece will also
[11:02] encode text into um integers but in a
[11:05] different schema and using a different
[11:08] vocabulary and sentence piece is a
[11:10] subword uh sort of tokenizer and what
[11:13] that means is that um you're not
[11:15] encoding entire words but you're not
[11:17] also encoding individual characters it's
[11:19] it's a subword unit level and that's
[11:22] usually what's adopted in practice for
[11:24] example also openai has this Library
[11:26] called tick token that uses a bite pair
[11:28] encode
[11:29] tokenizer um and that's what GPT uses
[11:33] and you can also just encode words into
[11:35] like hell world into a list of integers
[11:38] so as an example I'm using the Tik token
[11:40] Library here I'm getting the encoding
[11:43] for gpt2 or that was used for gpt2
[11:46] instead of just having 65 possible
[11:48] characters or tokens they have 50,000
[11:51] tokens and so when they encode the exact
[11:54] same string High there we only get a
[11:57] list of three integers but those
[11:59] integers are not between 0 and 64 they
[12:01] are between Z and 5,
[12:05] 5,256 so basically you can trade off the
[12:09] code book size and the sequence lengths
[12:12] so you can have very long sequences of
[12:13] integers with very small vocabularies or
[12:16] we can have short um sequences of
[12:20] integers with very large vocabularies
[12:23] and so typically people use in practice
[12:25] these subword encodings but I'd like to
[12:28] keep our token ier very simple so we're
[12:30] using character level tokenizer and that
[12:33] means that we have very small code books
[12:35] we have very simple encode and decode
[12:37] functions uh but we do get very long
[12:40] sequences as a result but that's the
[12:42] level at which we're going to stick with
[12:43] this lecture because it's the simplest
[12:45] thing okay so now that we have an
[12:46] encoder and a decoder effectively a
[12:49] tokenizer we can tokenize the entire
[12:51] training set of Shakespeare so here's a
[12:53] chunk of code that does that and I'm
[12:55] going to start to use the pytorch
[12:56] library and specifically the torch.
[12:58] tensor from the pytorch library so we're
[13:01] going to take all of the text in tiny
[13:03] Shakespeare encode it and then wrap it
[13:05] into a torch. tensor to get the data
[13:08] tensor so here's what the data tensor
[13:10] looks like when I look at just the first
[13:12] 1,000 characters or the 1,000 elements
[13:14] of it so we see that we have a massive
[13:16] sequence of integers and this sequence
[13:18] of integers here is basically an
[13:20] identical translation of the first
[13:22] 10,000 characters
[13:24] here so I believe for example that zero
[13:27] is a new line character and maybe one
[13:29] one is a space not 100% sure but from
[13:32] now on the entire data set of text is
[13:34] re-represented as just it's just
[13:35] stretched out as a single very large uh
[13:38] sequence of
[13:39] integers let me do one more thing before
[13:41] we move on here I'd like to separate out
[13:43] our data set into a train and a
[13:45] validation split so in particular we're
[13:48] going to take the first 90% of the data
[13:51] set and consider that to be the training
[13:52] data for the Transformer and we're going
[13:54] to withhold the last 10% at the end of
[13:56] it to be the validation data and this
[13:59] will help us understand to what extent
[14:01] our model is overfitting so we're going
[14:03] to basically hide and keep the
[14:04] validation data on the side because we
[14:06] don't want just a perfect memorization
[14:08] of this exact Shakespeare we want a
[14:11] neural network that sort of creates
[14:12] Shakespeare like uh text and so it
[14:15] should be fairly likely for it to
[14:17] produce the actual like stowed away uh
[14:21] true Shakespeare text um and so we're
[14:24] going to use this to uh get a sense of
[14:26] the overfitting okay so now we would
[14:28] like to start plugging these text
[14:30] sequences or integer sequences into the
[14:32] Transformer so that it can train and
[14:34] learn those patterns now the important
[14:36] thing to realize is we're never going to
[14:38] actually feed entire text into a
[14:40] Transformer all at once that would be
[14:42] computationally very expensive and
[14:44] prohibitive so when we actually train a
[14:46] Transformer on a lot of these data sets
[14:48] we only work with chunks of the data set
[14:50] and when we train the Transformer we
[14:52] basically sample random little chunks
[14:53] out of the training set and train on
[14:55] just chunks at a time and these chunks
[14:58] have basically some kind of a length and
[15:01] some maximum length now the maximum
[15:04] length typically at least in the code I
[15:06] usually write is called block size you
[15:08] can you can uh find it under different
[15:10] names like context length or something
[15:12] like that let's start with the block
[15:14] size of just eight and let me look at
[15:16] the first train data characters the
[15:18] first block size plus one characters
[15:20] I'll explain why plus one in a
[15:22] second so this is the first nine
[15:24] characters in the sequence in the
[15:27] training set now what I'd like to point
[15:30] out is that when you sample a chunk of
[15:31] data like this so say the these nine
[15:34] characters out of the training set this
[15:36] actually has multiple examples packed
[15:38] into it and uh that's because all of
[15:41] these characters follow each other and
[15:43] so what this thing is going to say when
[15:47] we plug it into a Transformer is we're
[15:49] going to actually simultaneously train
[15:50] it to make prediction at every one of
[15:52] these
[15:53] positions now in the in a chunk of nine
[15:56] characters there's actually eight indiv
[15:58] ual examples packed in there so there's
[16:01] the example that when 18 when in the
[16:04] context of 18 47 likely comes next in a
[16:08] context of 18 and 47 56 comes next in a
[16:12] context of 18 47 56 57 can come next and
[16:16] so on so that's the eight individual
[16:18] examples let me actually spell it out
[16:20] with
[16:21] code so here's a chunk of code to
[16:24] illustrate X are the inputs to the
[16:26] Transformer it will just be the first
[16:28] block size characters y will be the uh
[16:32] next block size characters so it's
[16:34] offset by one and that's because y are
[16:37] the targets for each position in the
[16:40] input and then here I'm iterating over
[16:42] all the block size of eight and the
[16:45] context is always all the characters in
[16:47] x uh up to T and including T and the
[16:51] target is always the teth character but
[16:53] in the targets array y so let me just
[16:56] run
[16:57] this and basically it spells out what I
[16:59] said in words uh these are the eight
[17:02] examples hidden in a chunk of nine
[17:04] characters that we uh sampled from the
[17:08] training set I want to mention one more
[17:11] thing we train on all the eight examples
[17:14] here with context between one all the
[17:16] way up to context of block size and we
[17:19] train on that not just for computational
[17:20] reasons because we happen to have the
[17:22] sequence already or something like that
[17:23] it's not just done for efficiency it's
[17:26] also done um to make the Transformer
[17:28] Network be used to seeing contexts all
[17:32] the way from as little as one all the
[17:33] way to block size and we'd like the
[17:36] transform to be used to seeing
[17:38] everything in between and that's going
[17:39] to be useful later during inference
[17:41] because while we're sampling we can
[17:43] start the sampling generation with as
[17:45] little as one character of context and
[17:47] the Transformer knows how to predict the
[17:49] next character with all the way up to
[17:51] just context of one and so then it can
[17:53] predict everything up to block size and
[17:55] after block size we have to start
[17:56] truncating because the Transformer will
[17:58] will never um receive more than block
[18:01] size inputs when it's predicting the
[18:03] next
[18:03] character Okay so we've looked at the
[18:06] time dimension of the tensors that are
[18:07] going to be feeding into the Transformer
[18:09] there's one more Dimension to care about
[18:11] and that is the batch Dimension and so
[18:13] as we're sampling these chunks of text
[18:15] we're going to be actually every time
[18:17] we're going to feed them into a
[18:18] Transformer we're going to have many
[18:20] batches of multiple chunks of text that
[18:22] are all like stacked up in a single
[18:23] tensor and that's just done for
[18:25] efficiency just so that we can keep the
[18:27] gpus busy uh because they are very good
[18:29] at parallel processing of um of data and
[18:33] so we just want to process multiple
[18:35] chunks all at the same time but those
[18:37] chunks are processed completely
[18:38] independently they don't talk to each
[18:39] other and so on so let me basically just
[18:42] generalize this and introduce a batch
[18:44] Dimension here's a chunk of
[18:46] code let me just run it and then I'm
[18:48] going to explain what it
[18:50] does so here because we're going to
[18:52] start sampling random locations in the
[18:54] data set to pull chunks from I am
[18:57] setting the seed so that um in the
[19:00] random number generator so that the
[19:01] numbers I see here are going to be the
[19:02] same numbers you see later if you try to
[19:04] reproduce this now the batch size here
[19:07] is how many independent sequences we are
[19:09] processing every forward backward pass
[19:11] of the
[19:12] Transformer the block size as I
[19:14] explained is the maximum context length
[19:16] to make those predictions so let's say B
[19:19] size four block size eight and then
[19:21] here's how we get batch for any
[19:23] arbitrary split if the split is a
[19:25] training split then we're going to look
[19:26] at train data otherwise at valid data
[19:30] that gives us the data array and then
[19:33] when I Generate random positions to grab
[19:35] a chunk out of I actually grab I
[19:38] actually generate batch size number of
[19:41] Random offsets so because this is four
[19:44] we are ex is going to be a uh four
[19:47] numbers that are randomly generated
[19:49] between zero and Len of data minus block
[19:51] size so it's just random offsets into
[19:53] the training
[19:54] set and then X's as I explained are the
[19:58] first first block size characters
[20:00] starting at I the Y's are the offset by
[20:05] one of that so just add plus one and
[20:08] then we're going to get those chunks for
[20:10] every one of integers I INX and use a
[20:14] torch. stack to take all those uh uh
[20:17] one-dimensional tensors as we saw here
[20:20] and we're going to um stack them up at
[20:24] rows and so they all become a row in a
[20:27] 4x8 tensor
[20:29] so here's where I'm printing then when I
[20:32] sample a batch XB and YB the inputs to
[20:35] the Transformer now are the input X is
[20:39] the 4x8 tensor four uh rows of eight
[20:44] columns and each one of these is a chunk
[20:47] of the training
[20:48] set and then the targets here are in the
[20:52] associated array Y and they will come in
[20:54] to the Transformer all the way at the
[20:55] end uh to um create the loss function
[20:59] uh so they will give us the correct
[21:01] answer for every single position inside
[21:03] X and then these are the four
[21:06] independent
[21:07] rows so spelled out as we did
[21:11] before uh this 4x8 array contains a
[21:14] total of 32 examples and they're
[21:17] completely independent as far as the
[21:19] Transformer is
[21:20] concerned uh so when the input is 24 the
[21:25] target is 43 or rather 43 here in the Y
[21:28] array
[21:29] when the input is 2443 the target is
[21:31] 58 uh when the input is 24 43 58 the
[21:34] target is 5 Etc or like when it is a 52
[21:38] 581 the target is
[21:40] 58 right so you can sort of see this
[21:43] spelled out these are the 32 independent
[21:45] examples packed in to a single batch of
[21:48] the input X and then the desired targets
[21:51] are in y and so now this integer tensor
[21:57] of um X is going to feed into the
[22:00] Transformer and that Transformer is
[22:02] going to simultaneously process all
[22:04] these examples and then look up the
[22:06] correct um integers to predict in every
[22:08] one of these positions in the tensor y
[22:11] okay so now that we have our batch of
[22:13] input that we'd like to feed into a
[22:15] Transformer let's start basically
[22:16] feeding this into neural networks now
[22:19] we're going to start off with the
[22:20] simplest possible neural network which
[22:22] in the case of language modeling in my
[22:23] opinion is the Byram language model and
[22:25] we've covered the Byram language model
[22:26] in my make more series in a lot of depth
[22:29] and so here I'm going to sort of go
[22:31] faster and let's just Implement pytorch
[22:33] module directly that implements the byr
[22:36] language
[22:36] model so I'm importing the pytorch um NN
[22:41] module uh for
[22:43] reproducibility and then here I'm
[22:44] constructing a Byram language model
[22:46] which is a subass of NN
[22:48] module and then I'm calling it and I'm
[22:51] passing it the inputs and the targets
[22:53] and I'm just printing now when the
[22:55] inputs on targets come here you see that
[22:57] I'm just taking the index uh the inputs
[23:00] X here which I rename to idx and I'm
[23:03] just passing them into this token
[23:04] embedding table so it's going on here is
[23:07] that here in the Constructor we are
[23:09] creating a token embedding table and it
[23:12] is of size vocap size by vocap
[23:15] size and we're using an. embedding which
[23:18] is a very thin wrapper around basically
[23:20] a tensor of shape voap size by vocab
[23:23] size and what's happening here is that
[23:25] when we pass idx here every single
[23:28] integer in our input is going to refer
[23:30] to this embedding table and it's going
[23:32] to pluck out a row of that embedding
[23:34] table corresponding to its index so 24
[23:37] here will go into the embedding table
[23:39] and we'll pluck out the 24th row and
[23:42] then 43 will go here and pluck out the
[23:44] 43d row Etc and then pytorch is going to
[23:47] arrange all of this into a batch by Time
[23:50] by channel uh tensor in this case batch
[23:53] is four time is eight and C which is the
[23:57] channels is vocab size or 65 and so
[24:01] we're just going to pluck out all those
[24:02] rows arrange them in a b by T by C and
[24:05] now we're going to interpret this as the
[24:07] logits which are basically the scores
[24:10] for the next character in the sequence
[24:12] and so what's happening here is we are
[24:14] predicting what comes next based on just
[24:17] the individual identity of a single
[24:19] token and you can do that because um I
[24:22] mean currently the tokens are not
[24:23] talking to each other and they're not
[24:25] seeing any context except for they're
[24:26] just seeing themselves so I'm a f I'm a
[24:29] token number five and then I can
[24:32] actually make pretty decent predictions
[24:33] about what comes next just by knowing
[24:35] that I'm token five because some
[24:37] characters uh know um C follow other
[24:39] characters in in typical scenarios so we
[24:42] saw a lot of this in a lot more depth in
[24:44] the make more series and here if I just
[24:46] run this then we currently get the
[24:49] predictions the scores the lits for
[24:53] every one of the 4x8 positions now that
[24:55] we've made predictions about what comes
[24:57] next we'd like to evaluate the loss
[24:58] function and so in make more series we
[25:00] saw that a good way to measure a loss or
[25:03] like a quality of the predictions is to
[25:05] use the negative log likelihood loss
[25:07] which is also implemented in pytorch
[25:09] under the name cross entropy so what we'
[25:12] like to do here is loss is the cross
[25:15] entropy on the predictions and the
[25:17] targets and so this measures the quality
[25:20] of the logits with respect to the
[25:21] Targets in other words we have the
[25:24] identity of the next character so how
[25:26] well are we predicting the next
[25:28] character based on the lits and
[25:30] intuitively the correct um the correct
[25:33] dimension of low jits uh depending on
[25:36] whatever the target is should have a
[25:38] very high number and all the other
[25:39] dimensions should be very low number
[25:41] right now the issue is that this won't
[25:44] actually this is what we want we want to
[25:46] basically output the logits and the
[25:50] loss this is what we want but
[25:52] unfortunately uh this won't actually run
[25:55] we get an error message but intuitively
[25:57] we want to uh measure this now when we
[26:01] go to the pytorch um cross entropy
[26:04] documentation here um we're trying to
[26:08] call the cross entropy in its functional
[26:10] form uh so that means we don't have to
[26:11] create like a module for it but here
[26:14] when we go to the documentation you have
[26:16] to look into the details of how pitor
[26:18] expects these inputs and basically the
[26:20] issue here is ptor expects if you have
[26:24] multi-dimensional input which we do
[26:25] because we have a b BYT by C tensor then
[26:28] it actually really wants the channels to
[26:31] be the second uh Dimension here so if
[26:35] you um so basically it wants a b by C
[26:38] BYT instead of a b by T by C and so it's
[26:42] just the details of how P torch treats
[26:45] um these kinds of inputs and so we don't
[26:49] actually want to deal with that so what
[26:51] we're going to do instead is we need to
[26:52] basically reshape our logits so here's
[26:54] what I like to do I like to take
[26:56] basically give names to the dimensions
[26:58] so lit. shape is B BYT by C and unpack
[27:01] those numbers and then let's uh say that
[27:04] logits equals lit. View and we want it
[27:07] to be a b * c b * T by C so just a two-
[27:11] dimensional
[27:12] array right so we're going to take all
[27:15] the we're going to take all of these um
[27:18] positions here and we're going to uh
[27:20] stretch them out in a onedimensional
[27:22] sequence and uh preserve the channel
[27:25] Dimension as the second
[27:26] dimension so we're just kind of like
[27:28] stretching out the array so it's two-
[27:29] dimensional and in that case it's going
[27:31] to better conform to what pytorch uh
[27:33] sort of expects in its Dimensions now we
[27:36] have to do the same to targets because
[27:38] currently targets are um of shape B by T
[27:44] and we want it to be just B * T so
[27:47] onedimensional now alternatively you
[27:49] could always still just do minus one
[27:51] because pytor will guess what this
[27:53] should be if you want to lay it out uh
[27:55] but let me just be explicit and say p *
[27:57] t once we've reshaped this it will match
[28:00] the cross entropy case and then we
[28:03] should be able to evaluate our
[28:06] loss okay so that R now and we can do
[28:10] loss and So currently we see that the
[28:12] loss is
[28:13] 4.87 now because our uh we have 65
[28:17] possible vocabulary elements we can
[28:19] actually guess at what the loss should
[28:20] be and in
[28:22] particular we covered negative log
[28:24] likelihood in a lot of detail we are
[28:26] expecting log or lawn of um 1 over 65
[28:32] and negative of that so we're expecting
[28:34] the loss to be about 4.1 17 but we're
[28:37] getting 4.87 and so that's telling us
[28:40] that the initial predictions are not uh
[28:42] super diffuse they've got a little bit
[28:43] of entropy and so we're guessing wrong
[28:47] uh so uh yes but actually we're I a we
[28:50] are able to evaluate the loss okay so
[28:53] now that we can evaluate the quality of
[28:54] the model on some data we'd like to also
[28:57] be able to generate from the model so
[28:59] let's do the generation now I'm going to
[29:01] go again a little bit faster here
[29:03] because I covered all this already in
[29:04] previous
[29:05] videos
[29:07] so here's a generate function for the
[29:11] model so we take some uh we take the the
[29:15] same kind of input idx here and
[29:18] basically this is the current uh context
[29:22] of some characters in a batch in some
[29:24] batch so it's also B BYT and the job of
[29:28] generate is to basically take this B BYT
[29:30] and extend it to be B BYT + 1 plus 2
[29:32] plus 3 and so it's just basically it
[29:34] continues the generation in all the
[29:36] batch dimensions in the time Dimension
[29:39] So that's its job and it will do that
[29:41] for Max new tokens so you can see here
[29:43] on the bottom there's going to be some
[29:45] stuff here but on the bottom whatever is
[29:47] predicted is concatenated on top of the
[29:50] previous idx along the First Dimension
[29:53] which is the time Dimension to create a
[29:54] b BYT + one so that becomes a new idx so
[29:58] the job of generate is to take a b BYT
[30:00] and make it a b BYT plus 1 plus 2 plus
[30:02] three as many as we want Max new tokens
[30:05] so this is the generation from the model
[30:08] now inside the generation what what are
[30:10] we doing we're taking the current
[30:11] indices we're getting the predictions so
[30:15] we get uh those are in the low jits and
[30:18] then the loss here is going to be
[30:19] ignored because um we're not we're not
[30:21] using that and we have no targets that
[30:23] are sort of ground truth targets that
[30:25] we're going to be comparing with
[30:28] then once we get the logits we are only
[30:30] focusing on the last step so instead of
[30:33] a b by T by C we're going to pluck out
[30:36] the negative-1 the last element in the
[30:38] time Dimension because those are the
[30:40] predictions for what comes next so that
[30:42] gives us the logits which we then
[30:44] convert to probabilities via softmax and
[30:47] then we use tor. multinomial to sample
[30:49] from those probabilities and we ask
[30:51] pytorch to give us one sample and so idx
[30:54] next will become a b by one because in
[30:57] each uh one of the batch Dimensions
[31:00] we're going to have a single prediction
[31:01] for what comes next so this num samples
[31:03] equals one will make this be a
[31:06] one and then we're going to take those
[31:08] integers that come from the sampling
[31:10] process according to the probability
[31:11] distribution given here and those
[31:13] integers got just concatenated on top of
[31:15] the current sort of like running stream
[31:17] of integers and this gives us a b BYT +
[31:20] one and then we can return that now one
[31:24] thing here is you see how I'm calling
[31:26] self of idx which will end up going to
[31:29] the forward function I'm not providing
[31:31] any Targets So currently this would give
[31:33] an error because targets is uh is uh
[31:36] sort of like not given so targets has to
[31:39] be optional so targets is none by
[31:41] default and then if targets is none then
[31:44] there's no loss to create so it's just
[31:47] loss is none but else all of this
[31:50] happens and we can create a loss so this
[31:53] will make it so um if we have the
[31:56] targets we provide them and get a loss
[31:57] if we have no targets it will'll just
[31:59] get the
[32:00] loits so this here will generate from
[32:02] the model um and let's take that for a
[32:06] ride
[32:08] now oops so I have another code chunk
[32:11] here which will generate for the model
[32:13] from the model and okay this is kind of
[32:15] crazy so maybe let me let me break this
[32:18] down so these are the idx
[32:23] right I'm creating a batch will be just
[32:26] one time will be just one so I'm
[32:30] creating a little one by one tensor and
[32:32] it's holding a zero and the D type the
[32:35] data type is uh integer so zero is going
[32:38] to be how we kick off the generation and
[32:40] remember that zero is uh is the element
[32:44] standing for a new line character so
[32:45] it's kind of like a reasonable thing to
[32:47] to feed in as the very first character
[32:49] in a sequence to be the new
[32:51] line um so it's going to be idx which
[32:54] we're going to feed in here then we're
[32:56] going to ask for 100 tokens
[32:58] and then. generate will continue that
[33:01] now because uh generate works on the
[33:05] level of batches we we then have to
[33:07] index into the zero throw to basically
[33:09] unplug the um the single batch Dimension
[33:13] that exists and then that gives us a um
[33:18] time steps just a onedimensional array
[33:20] of all the indices which we will convert
[33:23] to simple python list from pytorch
[33:26] tensor so that that can feed into our
[33:28] decode function and uh convert those
[33:32] integers into text so let me bring this
[33:34] back and we're generating 100 tokens
[33:37] let's
[33:37] run and uh here's the generation that we
[33:40] achieved so obviously it's garbage and
[33:43] the reason it's garbage is because this
[33:44] is a totally random model so next up
[33:47] we're going to want to train this model
[33:49] now one more thing I wanted to point out
[33:50] here is this function is written to be
[33:53] General but it's kind of like ridiculous
[33:55] right now because
[33:58] we're feeding in all this we're building
[33:59] out this context and we're concatenating
[34:02] it all and we're always feeding it all
[34:05] into the model but that's kind of
[34:07] ridiculous because this is just a simple
[34:09] Byram model so to make for example this
[34:11] prediction about K we only needed this W
[34:14] but actually what we fed into the model
[34:15] is we fed the entire sequence and then
[34:18] we only looked at the very last piece
[34:20] and predicted K so the only reason I'm
[34:23] writing it in this way is because right
[34:25] now this is a byr model but I'd like to
[34:27] keep keep this function fixed and I'd
[34:29] like it to work um later when our
[34:32] characters actually um basically look
[34:35] further in the history and so right now
[34:37] the history is not used so this looks
[34:39] silly uh but eventually the history will
[34:42] be used and so that's why we want to uh
[34:44] do it this way so just a quick comment
[34:46] on that so now we see that this is um
[34:49] random so let's train the model so it
[34:51] becomes a bit less random okay let's Now
[34:53] train the model so first what I'm going
[34:55] to do is I'm going to create a pyour
[34:57] optimization object so here we are using
[35:00] the optimizer ATM W um now in a make
[35:05] more series we've only ever use tastic
[35:06] gradi in descent the simplest possible
[35:08] Optimizer which you can get using the
[35:10] SGD instead but I want to use Adam which
[35:12] is a much more advanced and popular
[35:14] Optimizer and it works extremely well
[35:16] for uh typical good setting for the
[35:19] learning rate is roughly 3 E4 uh but for
[35:22] very very small networks like is the
[35:23] case here you can get away with much
[35:25] much higher learning rates R3 or even
[35:28] higher probably but let me create the
[35:30] optimizer object which will basically
[35:33] take the gradients and uh update the
[35:35] parameters using the
[35:36] gradients and then here our batch size
[35:40] up above was only four so let me
[35:41] actually use something bigger let's say
[35:43] 32 and then for some number of steps um
[35:46] we are sampling a new batch of data
[35:48] we're evaluating the loss uh we're
[35:51] zeroing out all the gradients from the
[35:52] previous step getting the gradients for
[35:54] all the parameters and then using those
[35:56] gradients to up update our parameters so
[35:58] typical training loop as we saw in the
[36:00] make more series so let me now uh run
[36:04] this for say 100 iterations and let's
[36:07] see what kind of losses we're going to
[36:09] get so we started around
[36:12] 4.7 and now we're getting to down to
[36:14] like 4.6 4.5 Etc so the optimization is
[36:18] definitely happening but um let's uh
[36:22] sort of try to increase number of
[36:23] iterations and only print at the
[36:25] end because we probably want train for
[36:29] longer okay so we're down to 3.6
[36:34] roughly roughly down to
[36:40] three this is the most janky
[36:46] optimization okay it's working let's
[36:48] just do
[36:50] 10,000 and then from here we want to
[36:53] copy this and hopefully that we're going
[36:56] to get something reason and of course
[36:58] it's not going to be Shakespeare from a
[37:00] byr model but at least we see that the
[37:01] loss is improving and uh hopefully we're
[37:05] expecting something a bit more
[37:06] reasonable okay so we're down at about
[37:08] 2.5 is let's see what we get okay
[37:12] dramatic improvements certainly on what
[37:14] we had here so let me just increase the
[37:17] number of tokens okay so we see that
[37:19] we're starting to get something at least
[37:21] like reasonable is
[37:25] um certainly not shakes spear but uh the
[37:29] model is making progress so that is the
[37:31] simplest possible
[37:33] model so now what I'd like to do
[37:36] is obviously this is a very simple model
[37:39] because the tokens are not talking to
[37:41] each other so given the previous context
[37:43] of whatever was generated we're only
[37:45] looking at the very last character to
[37:46] make the predictions about what comes
[37:48] next so now these uh now these tokens
[37:50] have to start talking to each other and
[37:53] figuring out what is in the context so
[37:55] that they can make better predictions
[37:56] for what comes next and this is how
[37:57] we're going to kick off the uh
[37:59] Transformer okay so next I took the code
[38:02] that we developed in this juper notebook
[38:03] and I converted it to be a script and
[38:05] I'm doing this because I just want to
[38:08] simplify our intermediate work into just
[38:10] the final product that we have at this
[38:12] point so in the top here I put all the
[38:15] hyp parameters that we to find I
[38:16] introduced a few and I'm going to speak
[38:18] to that in a little bit otherwise a lot
[38:20] of this should be recognizable uh
[38:23] reproducibility read data get the
[38:25] encoder and the decoder create the train
[38:27] into splits uh use the uh kind of like
[38:30] data loader um that gets a batch of the
[38:34] inputs and Targets this is new and I'll
[38:36] talk about it in a second now this is
[38:39] the Byram language model that we
[38:40] developed and it can forward and give us
[38:43] a logits and loss and it can
[38:45] generate and then here we are creating
[38:48] the optimizer and this is the training
[38:51] Loop so everything here should look
[38:53] pretty familiar now some of the small
[38:55] things that I added number one I added
[38:57] the ability to run on a GPU if you have
[39:00] it so if you have a GPU then you can
[39:02] this will use Cuda instead of just CPU
[39:04] and everything will be a lot more faster
[39:07] now when device becomes Cuda then we
[39:09] need to make sure that when we load the
[39:11] data we move it to
[39:13] device when we create the model we want
[39:15] to move uh the model parameters to
[39:18] device so as an example here we have the
[39:21] N an embedding table and it's got a
[39:23] weight inside it which stores the uh
[39:26] sort of lookup table so so that would be
[39:27] moved to the GPU so that all the
[39:29] calculations here happen on the GPU and
[39:32] they can be a lot faster and then
[39:34] finally here when I'm creating the
[39:35] context that feeds in to generate I have
[39:37] to make sure that I create it on the
[39:39] device number two what I introduced is
[39:43] uh the fact that here in the training
[39:46] Loop here I was just printing the um l.
[39:50] item inside the training Loop but this
[39:53] is a very noisy measurement of the
[39:54] current loss because every batch will be
[39:56] more or less lucky and so what I want to
[39:59] do usually um is uh I have an estimate
[40:02] loss function and the estimate loss
[40:05] basically then um goes up here and it
[40:10] averages up the loss over multiple
[40:12] batches so in particular we're going to
[40:15] iterate eval iter times and we're going
[40:17] to basically get our loss and then we're
[40:19] going to get the average loss for both
[40:21] splits and so this will be a lot less
[40:24] noisy so here when we call the estimate
[40:26] loss we're we're going to report the uh
[40:28] pretty accurate train and validation
[40:31] loss now when we come back up you'll
[40:33] notice a few things here I'm setting the
[40:35] model to evaluation phase and down here
[40:38] I'm resetting it back to training phase
[40:40] now right now for our model as is this
[40:42] doesn't actually do anything because the
[40:44] only thing inside this model is this uh
[40:46] nn. embedding and um this this um
[40:51] Network would behave both would behave
[40:53] the same in both evaluation mode and
[40:55] training mode we have no drop off layers
[40:57] we have no batm layers Etc but it is a
[41:00] good practice to Think Through what mode
[41:02] your neural network is in because some
[41:04] layers will have different Behavior Uh
[41:07] at inference time or training time and
[41:11] there's also this context manager torch
[41:12] up nograd and this is just telling
[41:14] pytorch that everything that happens
[41:16] inside this function we will not call do
[41:18] backward on and so pytorch can be a lot
[41:21] more efficient with its memory use
[41:23] because it doesn't have to store all the
[41:25] intermediate variables uh because we're
[41:27] never going to call backward and so it
[41:29] can it can be a lot more memory
[41:30] efficient in that way so also a good
[41:32] practice to tpy torch when we don't
[41:35] intend to do back
[41:36] propagation so right now this script is
[41:39] about 120 lines of code of and that's
[41:43] kind of our starter code I'm calling it
[41:45] b.p and I'm going to release it later
[41:48] now running this
[41:50] script gives us output in the terminal
[41:52] and it looks something like this it
[41:54] basically as I ran this code uh it was
[41:57] giving me the train loss and Val loss
[41:59] and we see that we convert to somewhere
[42:01] around
[42:01] 2.5 with the pyr model and then here's
[42:04] the sample that we produced at the
[42:07] end and so we have everything packaged
[42:09] up in the script and we're in a good
[42:11] position now to iterate on this okay so
[42:13] we are almost ready to start writing our
[42:15] very first self attention block for
[42:18] processing these uh tokens now before we
[42:22] actually get there I want to get you
[42:24] used to a mathematical trick that is
[42:26] used in the self attention inside a
[42:28] Transformer and is really just like at
[42:30] the heart of an an efficient
[42:32] implementation of self attention and so
[42:34] I want to work with this toy example to
[42:36] just get you used to this operation and
[42:38] then it's going to make it much more
[42:39] clear once we actually get to um to it
[42:43] uh in the script
[42:44] again so let's create a b BYT by C where
[42:47] BT and C are just 48 and two in the toy
[42:50] example and these are basically channels
[42:53] and we have uh batches and we have the
[42:55] time component and we have information
[42:58] at each point in the sequence so
[43:01] see now what we would like to do is we
[43:03] would like these um tokens so we have up
[43:06] to eight tokens here in a batch and
[43:08] these eight tokens are currently not
[43:10] talking to each other and we would like
[43:11] them to talk to each other we'd like to
[43:13] couple them and in particular we don't
[43:17] we we want to couple them in a very
[43:18] specific way so the token for example at
[43:21] the fifth location it should not
[43:23] communicate with tokens in the sixth
[43:25] seventh and eighth location
[43:27] because uh those are future tokens in
[43:29] the sequence the token on the fifth
[43:31] location should only talk to the one in
[43:33] the fourth third second and first so
[43:36] it's only so information only flows from
[43:38] previous context to the current time
[43:40] step and we cannot get any information
[43:42] from the future because we are about to
[43:44] try to predict the
[43:45] future so what is the easiest way for
[43:49] tokens to communicate okay the easiest
[43:52] way I would say is okay if we're up to
[43:54] if we're a fifth token and I'd like to
[43:56] communicate with my past the simplest
[43:58] way we can do that is to just do a
[44:00] weight is to just do an average of all
[44:03] the um of all the preceding elements so
[44:06] for example if I'm the fif token I would
[44:08] like to take the channels uh that make
[44:10] up that are information at my step but
[44:13] then also the channels from the fourth
[44:15] step third step second step and the
[44:17] first step I'd like to average those up
[44:19] and then that would become sort of like
[44:21] a feature Vector that summarizes me in
[44:23] the context of my history now of course
[44:26] just doing a sum or like an average is
[44:28] an extremely weak form of interaction
[44:30] like this communication is uh extremely
[44:32] lossy we've lost a ton of information
[44:34] about the spatial Arrangements of all
[44:35] those tokens uh but that's okay for now
[44:38] we'll see how we can bring that
[44:39] information back later for now what we
[44:41] would like to do is for every single
[44:43] batch element independently for every
[44:46] teeth token in that sequence we'd like
[44:49] to now calculate the average of all the
[44:53] vectors in all the previous tokens and
[44:55] also at this token so let's write that
[44:58] out um I have a small snippet here and
[45:01] instead of just fumbling around let me
[45:03] just copy paste it and talk to
[45:05] it so in other words we're going to
[45:08] create X and B is short for bag of words
[45:12] because bag of words is um is kind of
[45:15] like um a term that people use when you
[45:17] are just averaging up things so this is
[45:19] just a bag of words basically there's a
[45:21] word stored on every one of these eight
[45:23] locations and we're doing a bag of words
[45:25] we're just averaging
[45:27] so in the beginning we're going to say
[45:28] that it's just initialized at Zero and
[45:30] then I'm doing a for Loop here so we're
[45:32] not being efficient yet that's coming
[45:34] but for now we're just iterating over
[45:36] all the batch Dimensions independently
[45:38] iterating over time and then the
[45:40] previous uh tokens are at this uh batch
[45:45] Dimension and then everything up to and
[45:47] including the teeth token okay so when
[45:51] we slice out X in this way X prev
[45:54] Becomes of shape um how many T elements
[45:58] there were in the past and then of
[46:00] course C so all the two-dimensional
[46:02] information from these little tokens so
[46:05] that's the previous uh sort of chunk of
[46:08] um tokens from my current sequence and
[46:12] then I'm just doing the average or the
[46:13] mean over the zero Dimension so I'm
[46:15] averaging out the time here and I'm just
[46:19] going to get a little c one dimensional
[46:21] Vector which I'm going to store in X bag
[46:23] of words so I can run this and and uh
[46:27] this is not going to be very informative
[46:30] because let's see so this is X of Zer so
[46:32] this is the zeroth batch element and
[46:35] then expo at zero now you see how the at
[46:40] the first location here you see that the
[46:42] two are equal and that's because it's
[46:45] we're just doing an average of this one
[46:46] token but here this one is now an
[46:49] average of these two and now this one is
[46:53] an average of these
[46:54] three and so on
[46:57] so uh and this last one is the average
[47:01] of all of these elements so vertical
[47:03] average just averaging up all the tokens
[47:05] now gives this outcome
[47:07] here so this is all well and good uh but
[47:10] this is very inefficient now the trick
[47:12] is that we can be very very efficient
[47:14] about doing this using matrix
[47:16] multiplication so that's the
[47:18] mathematical trick and let me show you
[47:19] what I mean let's work with the toy
[47:21] example here let me run it and I'll
[47:24] explain I have a simple Matrix here that
[47:27] is a 3X3 of all ones a matrix B of just
[47:31] random numbers and it's a 3x2 and a
[47:33] matrix C which will be 3x3 multip 3x2
[47:36] which will give out a 3x2 so here we're
[47:39] just using um matrix multiplication so a
[47:43] multiply B gives us
[47:46] C okay so how are these numbers in C um
[47:51] achieved right so this number in the top
[47:54] left is the first row of a dot product
[47:57] with the First Column of B and since all
[48:00] the the row of a right now is all just
[48:02] ones then the do product here with with
[48:05] this column of B is just going to do a
[48:07] sum of these of this column so 2 + 6 + 6
[48:11] is
[48:12] 14 the element here in the output of C
[48:15] is also the first column here the first
[48:17] row of a multiplied now with the second
[48:20] column of B so 7 + 4 + 5 is 16 now you
[48:25] see that there's repeating elements here
[48:26] so this 14 again is because this row is
[48:28] again all ones and it's multiplying the
[48:30] First Column of B so we get 14 and this
[48:33] one is and so on so this last number
[48:35] here is the last row do product last
[48:39] column now the trick here is uh the
[48:42] following this is just a boring number
[48:44] of um it's just a boring array of all
[48:48] ones but torch has this function called
[48:50] Trail which is short for a
[48:54] triangular uh something like that and
[48:56] you can wrap it in torch up once and it
[48:58] will just return the lower triangular
[49:00] portion of this
[49:03] okay so now it will basically zero out
[49:06] uh these guys here so we just get the
[49:08] lower triangular part well what happens
[49:10] if we do
[49:14] that so now we'll have a like this and B
[49:17] like this and now what are we getting
[49:18] here in C well what is this number well
[49:22] this is the first row times the First
[49:24] Column and because this is zeros
[49:28] uh these elements here are now ignored
[49:30] so we just get a two and then this
[49:32] number here is the first row times the
[49:35] second column and because these are
[49:37] zeros they get ignored and it's just
[49:39] seven this seven multiplies this one but
[49:42] look what happened here because this is
[49:43] one and then zeros we what ended up
[49:46] happening is we're just plucking out the
[49:48] row of this row of B and that's what we
[49:51] got now here we have one 1 Z so here 110
[49:57] do product with these two columns will
[49:59] now give us 2 + 6 which is 8 and 7 + 4
[50:02] which is 11 and because this is 111 we
[50:05] ended up with the addition of all of
[50:07] them and so basically depending on how
[50:10] many ones and zeros we have here we are
[50:12] basically doing a sum currently of a
[50:16] variable number of these rows and that
[50:18] gets deposited into
[50:20] C So currently we're doing sums because
[50:23] these are ones but we can also do
[50:25] average right and you can start to see
[50:27] how we could do average uh of the rows
[50:29] of B uh sort of in an incremental
[50:32] fashion because we don't have to we can
[50:35] basically normalize these rows so that
[50:37] they sum to one and then we're going to
[50:39] get an average so if we took a and then
[50:41] we did aals
[50:43] aide torch. sum in the um of a in the um
[50:51] oneth Dimension and then let's keep them
[50:55] as true so so therefore the broadcasting
[50:57] will work out so if I rerun this you see
[51:00] now that these rows now sum to one so
[51:04] this row is one this row is 0. 5.5 Z and
[51:07] here we get 1/3 and now when we do a
[51:09] multiply B what are we getting here we
[51:12] are just getting the first row first row
[51:15] here now we are getting the average of
[51:18] the first two
[51:20] rows okay so 2 and six average is four
[51:23] and four and seven average is
[51:25] 5.5 and on the bottom here we are now
[51:27] getting the average of these three rows
[51:31] so the average of all of elements of B
[51:33] are now deposited here and so you can
[51:36] see that by manipulating these uh
[51:40] elements of this multiplying Matrix and
[51:42] then multiplying it with any given
[51:44] Matrix we can do these averages in this
[51:47] incremental fashion because we just get
[51:50] um and we can manipulate that based on
[51:53] the elements of a okay so that's very
[51:55] convenient so let's let's swing back up
[51:57] here and see how we can vectorize this
[51:59] and make it much more efficient using
[52:00] what we've learned so in
[52:03] particular we are going to produce an
[52:05] array a but here I'm going to call it we
[52:08] short for weights but this is our
[52:11] a and this is how much of every row we
[52:14] want to average up and it's going to be
[52:17] an average because you can see that
[52:18] these rows sum to
[52:20] one so this is our a and then our B in
[52:23] this example of course is X
[52:27] so what's going to happen here now is
[52:29] that we are going to have an expo
[52:31] 2 and this Expo 2 is going to be way
[52:36] multiplying
[52:38] RX so let's think this true way is T BYT
[52:42] and this is Matrix multiplying in
[52:44] pytorch a b by T by
[52:47] C and it's giving us uh different what
[52:50] shape so pytorch will come here and it
[52:52] will see that these shapes are not the
[52:54] same so it will create a batch Dimension
[52:57] here and this is a batched matrix
[53:00] multiply and so it will apply this
[53:02] matrix multiplication in all the batch
[53:04] elements um in parallel and individually
[53:08] and then for each batch element there
[53:09] will be a t BYT multiplying T by C
[53:12] exactly as we had
[53:15] below so this will now create B by T by
[53:20] C and Expo 2 will now become identical
[53:24] to Expo
[53:28] so we can see that torch. all close of
[53:32] xbo and xbo 2 should be true
[53:36] now so this kind of like convinces us
[53:38] that uh these are in fact um the same so
[53:43] xbo and xbo 2 if I just print
[53:47] them uh okay we're not going to be able
[53:49] to okay we're not going to be able to
[53:51] just stare it down but
[53:54] um well let me try Expo basically just
[53:56] at the zeroth element and Expo two at
[53:58] the zeroth element so just the first
[53:59] batch and we should see that this and
[54:02] that should be identical which they
[54:04] are right so what happened here the
[54:07] trick is we were able to use batched
[54:09] Matrix multiply to do this uh
[54:12] aggregation really and it's a weighted
[54:15] aggregation and the weights are
[54:17] specified in this um T BYT array and
[54:21] we're basically doing weighted sums and
[54:24] uh these weighted sums are are U
[54:26] according to uh the weights inside here
[54:28] they take on sort of this triangular
[54:31] form and so that means that a token at
[54:33] the teth dimension will only get uh sort
[54:36] of um information from the um tokens
[54:39] perceiving it so that's exactly what we
[54:41] want and finally I would like to rewrite
[54:43] it in one more way and we're going to
[54:46] see why that's useful so this is the
[54:48] third version and it's also identical to
[54:50] the first and second but let me talk
[54:53] through it it uses
[54:54] softmax so Trill here is this Matrix
[55:00] lower triangular
[55:01] ones way begins as all
[55:05] zero okay so if I just print way in the
[55:07] beginning it's all zero then I
[55:11] used masked fill so what this is doing
[55:15] is we. masked fill it's all zeros and
[55:18] I'm saying for all the elements where
[55:20] Trill is equal equal Z make them be
[55:23] negative Infinity so all the elements
[55:26] where Trill is zero will become negative
[55:28] Infinity now so this is what we get and
[55:32] then the final line here is
[55:36] softmax so if I take a softmax along
[55:38] every single so dim is negative one so
[55:40] along every single row if I do softmax
[55:44] what is that going to
[55:46] do well softmax is um is also like a
[55:51] normalization operation right and so
[55:54] spoiler alert you get the exact same
[55:58] Matrix let me bring back to
[56:00] softmax and recall that in softmax we're
[56:02] going to exponentiate every single one
[56:04] of these and then we're going to divide
[56:06] by the sum and so if we exponentiate
[56:10] every single element here we're going to
[56:11] get a one and here we're going to get uh
[56:14] basically zero 0 z0 Z everywhere else
[56:17] and then when we normalize we just get
[56:19] one here we're going to get one one and
[56:21] then zeros and then softmax will again
[56:24] divide and this will give us 5.5 and so
[56:27] on and so this is also the uh the same
[56:30] way to produce uh this mask now the
[56:33] reason that this is a bit more
[56:34] interesting and the reason we're going
[56:36] to end up using it in self
[56:37] attention is that these weights here
[56:41] begin uh with zero and you can think of
[56:44] this as like an interaction strength or
[56:46] like an affinity so basically it's
[56:49] telling us how much of each uh token
[56:52] from the past do we want to Aggregate
[56:54] and average up
[56:57] and then this line is saying tokens from
[56:59] the past cannot communicate by setting
[57:02] them to negative Infinity we're saying
[57:04] that we will not aggregate anything from
[57:06] those
[57:07] tokens and so basically this then goes
[57:09] through softmax and through the weighted
[57:11] and this is the aggregation through
[57:12] matrix
[57:14] multiplication and so what this is now
[57:16] is you can think of these as um these
[57:19] zeros are currently just set by us to be
[57:21] zero but a quick preview is that these
[57:25] affinities between the tokens are not
[57:27] going to be just constant at zero
[57:29] they're going to be data dependent these
[57:31] tokens are going to start looking at
[57:32] each other and some tokens will find
[57:34] other tokens more or less interesting
[57:37] and depending on what their values are
[57:39] they're going to find each other
[57:41] interesting to different amounts and I'm
[57:42] going to call those affinities I think
[57:45] and then here we are saying the future
[57:47] cannot communicate with the past we're
[57:49] we're going to clamp them and then when
[57:51] we normalize and sum we're going to
[57:53] aggregate uh sort of their values
[57:56] depending on how interesting they find
[57:57] each other and so that's the preview for
[57:59] self attention and basically long story
[58:03] short from this entire section is that
[58:05] you can do weighted aggregations of your
[58:07] past
[58:08] Elements by having by using matrix
[58:12] multiplication of a lower triangular
[58:14] fashion and then the elements here in
[58:17] the lower triangular part are telling
[58:18] you how much of each element uh fuses
[58:21] into this position so we're going to use
[58:24] this trick now to develop the self
[58:25] attention block block so first let's get
[58:27] some quick preliminaries out of the way
[58:30] first the thing I'm kind of bothered by
[58:31] is that you see how we're passing in
[58:33] vocap size into the Constructor there's
[58:35] no need to do that because vocap size is
[58:36] already defined uh up top as a global
[58:38] variable so there's no need to pass this
[58:40] stuff
[58:41] around next what I want to do is I don't
[58:44] want to actually create I want to create
[58:46] like a level of indirection here where
[58:47] we don't directly go to the embedding
[58:49] for the um logits but instead we go
[58:52] through this intermediate phase because
[58:54] we're going to start making that bigger
[58:57] so let me introduce a new variable n
[58:59] embed it shorted for number of embedding
[59:02] Dimensions so
[59:04] nbed here will be say 32 that was a
[59:09] suggestion from GitHub co-pilot by the
[59:11] way um it also suest 32 which is a good
[59:14] number so this is an embedding table and
[59:16] only 32 dimensional
[59:18] embeddings so then here this is not
[59:21] going to give us logits directly instead
[59:23] this is going to give us token
[59:24] embeddings that's I'm going to call it
[59:27] and then to go from the token Tings to
[59:29] the logits we're going to need a linear
[59:30] layer so self. LM head let's call it
[59:34] short for language modeling head is n
[59:36] and linear from n ined up to vocap size
[59:39] and then when we swing over here we're
[59:41] actually going to get the loits by
[59:43] exactly what the co-pilot says now we
[59:46] have to be careful here because this C
[59:48] and this C are not equal um this is nmed
[59:52] C and this is vocap size so let's just
[59:55] say that n ined is equal to
[59:57] C and then this just creates one spous
[1:00:01] layer of interaction through a linear
[1:00:02] layer but uh this should basically
[1:00:11] run so we see that this runs and uh this
[1:00:15] currently looks kind of spous but uh
[1:00:17] we're going to build on top of this now
[1:00:19] next up so far we've taken these indices
[1:00:22] and we've encoded them based on the
[1:00:23] identity of the uh tokens in inside idx
[1:00:28] the next thing that people very often do
[1:00:30] is that we're not just encoding the
[1:00:31] identity of these tokens but also their
[1:00:33] position so we're going to have a second
[1:00:35] position uh embedding table here so
[1:00:38] self. position embedding table is an an
[1:00:41] embedding of block size by an embed and
[1:00:44] so each position from zero to block size
[1:00:46] minus one will also get its own
[1:00:47] embedding vector and then here first let
[1:00:50] me decode B BYT from idx do
[1:00:54] shape and then here we're also going to
[1:00:56] have a pause embedding which is the
[1:00:58] positional embedding and these are this
[1:01:00] is to arrange so this will be basically
[1:01:03] just integers from Z to T minus one and
[1:01:06] all of those integers from 0 to T minus
[1:01:08] one get embedded through the table to
[1:01:09] create a t by
[1:01:11] C and then here this gets renamed to
[1:01:14] just say x and x will be the addition of
[1:01:18] the token embeddings with the positional
[1:01:20] embeddings and here the broadcasting
[1:01:22] note will work out so B by T by C plus T
[1:01:25] by C
[1:01:26] this gets right aligned a new dimension
[1:01:28] of one gets added and it gets
[1:01:30] broadcasted across
[1:01:31] batch so at this point x holds not just
[1:01:34] the token identities but the positions
[1:01:37] at which these tokens occur and this is
[1:01:39] currently not that useful because of
[1:01:41] course we just have a simple byr model
[1:01:43] so it doesn't matter if you're in the
[1:01:44] fifth position the second position or
[1:01:46] wherever it's all translation invariant
[1:01:48] at this stage uh so this information
[1:01:50] currently wouldn't help uh but as we
[1:01:52] work on the self attention block we'll
[1:01:54] see that this starts to matter
[1:01:59] okay so now we get the Crux of self
[1:02:01] attention so this is probably the most
[1:02:03] important part of this video to
[1:02:05] understand we're going to implement a
[1:02:07] small self attention for a single
[1:02:08] individual head as they're called so we
[1:02:11] start off with where we were so all of
[1:02:13] this code is familiar so right now I'm
[1:02:16] working with an example where I Chang
[1:02:17] the number of channels from 2 to 32 so
[1:02:20] we have a 4x8 arrangement of tokens and
[1:02:24] each to and the information each token
[1:02:26] is currently 32 dimensional but we just
[1:02:28] are working with random
[1:02:30] numbers now we saw here that the code as
[1:02:34] we had it before does a uh simple weight
[1:02:37] simple average of all the past tokens
[1:02:41] and the current token so it's just the
[1:02:43] previous information and current
[1:02:44] information is just being mixed together
[1:02:45] in an average and that's what this code
[1:02:48] currently achieves and it Doo by
[1:02:50] creating this lower triangular structure
[1:02:52] which allows us to mask out this uh we
[1:02:55] uh Matrix that we create so we mask it
[1:02:59] out and then we normalize it and
[1:03:01] currently when we initialize the
[1:03:03] affinities between all the different
[1:03:05] sort of tokens or nodes I'm going to use
[1:03:08] those terms
[1:03:09] interchangeably so when we initialize
[1:03:11] the affinities between all the different
[1:03:13] tokens to be zero then we see that way
[1:03:16] gives us this um structure where every
[1:03:18] single row has these um uniform numbers
[1:03:22] and so that's what that's what then uh
[1:03:25] in this Matrix multiply makes it so that
[1:03:27] we're doing a simple
[1:03:28] average now we don't actually want this
[1:03:32] to be all uniform because different uh
[1:03:36] tokens will find different other tokens
[1:03:38] more or less interesting and we want
[1:03:40] that to be data dependent so for example
[1:03:42] if I'm a vowel then maybe I'm looking
[1:03:44] for consonants in my past and maybe I
[1:03:46] want to know what those consonants are
[1:03:48] and I want that information to flow to
[1:03:50] me and so I want to now gather
[1:03:52] information from the past but I want to
[1:03:54] do it in the data dependent way and this
[1:03:56] is the problem that self attention
[1:03:58] solves now the way self attention solves
[1:04:00] this is the following every single node
[1:04:03] or every single token at each position
[1:04:06] will emit two vectors it will emit a
[1:04:09] query and it will emit a
[1:04:12] key now the query Vector roughly
[1:04:15] speaking is what am I looking for and
[1:04:18] the key Vector roughly speaking is what
[1:04:20] do I
[1:04:21] contain and then the way we get
[1:04:24] affinities between these uh tokens now
[1:04:27] in a sequence is we basically just do a
[1:04:29] do product between the keys and the
[1:04:31] queries so my query dot products with
[1:04:35] all the keys of all the other tokens and
[1:04:37] that dot product now becomes
[1:04:41] wayy and so um if the key and the query
[1:04:45] are sort of aligned they will interact
[1:04:47] to a very high amount and then I will
[1:04:50] get to learn more about that specific
[1:04:52] token as opposed to any other token in
[1:04:55] the sequence
[1:04:56] so let's implement this
[1:05:00] now we're going to implement a
[1:05:03] single what's called head of self
[1:05:07] attention so this is just one head
[1:05:09] there's a hyper parameter involved with
[1:05:10] these heads which is the head size and
[1:05:13] then here I'm initializing linear
[1:05:15] modules and I'm using bias equals false
[1:05:18] so these are just going to apply a
[1:05:19] matrix multiply with some fixed
[1:05:21] weights and now let me produce a key and
[1:05:26] q k and Q by forwarding these modules on
[1:05:29] X so the size of this will now
[1:05:32] become B by T by 16 because that is the
[1:05:36] head size and the same here B by T by
[1:05:44] 16 so this being the head size so you
[1:05:47] see here that when I forward this linear
[1:05:49] on top of my X all the tokens in all the
[1:05:52] positions in the B BYT Arrangement all
[1:05:55] of them them in parallel and
[1:05:57] independently produce a key and a query
[1:05:59] so no communication has happened
[1:06:01] yet but the communication comes now all
[1:06:04] the queries will do product with all the
[1:06:07] keys so basically what we want is we
[1:06:09] want way now or the affinities between
[1:06:12] these to be query multiplying key but we
[1:06:16] have to be careful with uh we can't
[1:06:18] Matrix multiply this we actually need to
[1:06:20] transpose uh K but we have to be also
[1:06:23] careful because these are when you have
[1:06:25] The Bash Dimension so in particular we
[1:06:27] want to transpose uh the last two
[1:06:30] dimensions dimension1 and dimension -2
[1:06:33] so
[1:06:36] -21 and so this Matrix multiply now will
[1:06:40] basically do the following B by T by
[1:06:44] 16 Matrix multiplies B by 16 by T to
[1:06:49] give us B by T by
[1:06:53] T right
[1:06:56] so for every row of B we're now going to
[1:06:58] have a t Square Matrix giving us the
[1:07:01] affinities and these are now the way so
[1:07:04] they're not zeros they are now coming
[1:07:06] from this dot product between the keys
[1:07:08] and the queries so this can now run I
[1:07:11] can I can run this and the weighted
[1:07:13] aggregation now is a function in a data
[1:07:16] Bandon manner between the keys and
[1:07:18] queries of these nodes so just
[1:07:20] inspecting what happened
[1:07:22] here the way takes on this form
[1:07:26] and you see that before way was uh just
[1:07:29] a constant so it was applied in the same
[1:07:31] way to all the batch elements but now
[1:07:33] every single batch elements will have
[1:07:34] different sort of we because uh every
[1:07:37] single batch element contains different
[1:07:39] uh tokens at different positions and so
[1:07:41] this is not data dependent so when we
[1:07:44] look at just the zeroth uh Row for
[1:07:47] example in the input these are the
[1:07:49] weights that came out and so you can see
[1:07:51] now that they're not just exactly
[1:07:53] uniform um and in particular as an
[1:07:55] example here for the last row this was
[1:07:58] the eighth token and the eighth token
[1:08:00] knows what content it has and it knows
[1:08:02] at what position it's in and now the E
[1:08:04] token based on that uh creates a query
[1:08:08] hey I'm looking for this kind of stuff
[1:08:10] um I'm a vowel I'm on the E position I'm
[1:08:12] looking for any consonant at positions
[1:08:14] up to four and then all the nodes get to
[1:08:18] emit keys and maybe one of the channels
[1:08:20] could be I am a I am a consonant and I
[1:08:23] am in a position up to four and that
[1:08:25] that key would have a high number in
[1:08:27] that specific Channel and that's how the
[1:08:29] query and the key when they do product
[1:08:31] they can find each other and create a
[1:08:33] high affinity and when they have a high
[1:08:35] Affinity like say uh this token was
[1:08:38] pretty interesting to uh to this eighth
[1:08:41] token when they have a high Affinity
[1:08:43] then through the softmax I will end up
[1:08:45] aggregating a lot of its information
[1:08:47] into my position and so I'll get to
[1:08:49] learn a lot about
[1:08:51] it now just this we're looking at way
[1:08:55] after this has already happened um let
[1:08:59] me erase this operation as well so let
[1:09:01] me erase the masking and the softmax
[1:09:03] just to show you the under the hood
[1:09:04] internals and how that works so without
[1:09:07] the masking in the softmax Whey comes
[1:09:09] out like this right this is the outputs
[1:09:11] of the do products um and these are the
[1:09:14] raw outputs and they take on values from
[1:09:15] negative you know two to positive two
[1:09:18] Etc so that's the raw interactions and
[1:09:21] raw affinities between all the nodes but
[1:09:24] now if I'm going if I'm a fifth node I
[1:09:26] will not want to aggregate anything from
[1:09:28] the sixth node seventh node and the
[1:09:30] eighth node so actually we use the upper
[1:09:32] triangular masking so those are not
[1:09:35] allowed to
[1:09:37] communicate and now we actually want to
[1:09:40] have a nice uh distribution uh so we
[1:09:42] don't want to aggregate negative .11 of
[1:09:45] this node that's crazy so instead we
[1:09:47] exponentiate and normalize and now we
[1:09:49] get a nice distribution that sums to one
[1:09:51] and this is telling us now in the data
[1:09:52] dependent manner how much of information
[1:09:54] to aggregate from any of these tokens in
[1:09:56] the
[1:09:58] past so that's way and it's not zeros
[1:10:01] anymore but but it's calculated in this
[1:10:04] way now there's one more uh part to a
[1:10:08] single self attention head and that is
[1:10:10] that when we do the aggregation we don't
[1:10:12] actually aggregate the tokens exactly we
[1:10:15] aggregate we produce one more value here
[1:10:17] and we call that the
[1:10:20] value so in the same way that we
[1:10:22] produced p and query we're also going to
[1:10:23] create a value
[1:10:26] and
[1:10:26] then here we don't
[1:10:30] aggregate X we calculate a v which is
[1:10:34] just achieved by uh propagating this
[1:10:37] linear on top of X again and then we
[1:10:40] output way multiplied by V so V is the
[1:10:44] elements that we aggregate or the the
[1:10:46] vectors that we aggregate instead of the
[1:10:47] raw
[1:10:48] X and now of course uh this will make it
[1:10:51] so that the output here of this single
[1:10:53] head will be 16 dimensional because that
[1:10:55] is the head
[1:10:57] size so you can think of X as kind of
[1:10:59] like private information to this token
[1:11:01] if you if you think about it that way so
[1:11:03] X is kind of private to this token so
[1:11:06] I'm a fifth token at some and I have
[1:11:08] some identity and uh my information is
[1:11:11] kept in Vector X and now for the
[1:11:14] purposes of the single head here's what
[1:11:16] I'm interested in here's what I have and
[1:11:20] if you find me interesting here's what I
[1:11:21] will communicate to you and that's
[1:11:23] stored in v and so V is the thing that
[1:11:26] gets aggregated for the purposes of this
[1:11:28] single head between the different
[1:11:30] notes and that's uh basically the self
[1:11:34] attention mechanism this is this is what
[1:11:36] it does there are a few notes that I
[1:11:39] would make like to make about attention
[1:11:41] number one attention is a communication
[1:11:44] mechanism you can really think about it
[1:11:46] as a communication mechanism where you
[1:11:48] have a number of nodes in a directed
[1:11:50] graph where basically you have edges
[1:11:52] pointed between noes like
[1:11:53] this and what happens is every node has
[1:11:56] some Vector of information and it gets
[1:11:58] to aggregate information via a weighted
[1:12:01] sum from all of the nodes that point to
[1:12:03] it and this is done in a data dependent
[1:12:06] manner so depending on whatever data is
[1:12:08] actually stored that you should not at
[1:12:09] any point in time now our graph doesn't
[1:12:13] look like this our graph has a different
[1:12:15] structure we have eight nodes because
[1:12:17] the block size is eight and there's
[1:12:18] always eight to
[1:12:20] tokens and uh the first node is only
[1:12:23] pointed to by itself the second node is
[1:12:25] pointed to by the first node and itself
[1:12:27] all the way up to the eighth node which
[1:12:29] is pointed to by all the previous nodes
[1:12:32] and itself and so that's the structure
[1:12:34] that our directed graph has or happens
[1:12:37] happens to have in Auto regressive sort
[1:12:38] of scenario like language modeling but
[1:12:41] in principle attention can be applied to
[1:12:42] any arbitrary directed graph and it's
[1:12:44] just a communication mechanism between
[1:12:46] the nodes the second note is that notice
[1:12:48] that there is no notion of space so
[1:12:51] attention simply acts over like a set of
[1:12:53] vectors in this graph and so by default
[1:12:56] these nodes have no idea where they are
[1:12:58] positioned in the space and that's why
[1:12:59] we need to encode them positionally and
[1:13:02] sort of give them some information that
[1:13:03] is anchored to a specific position so
[1:13:05] that they sort of know where they are
[1:13:08] and this is different than for example
[1:13:09] from convolution because if you're run
[1:13:11] for example a convolution operation over
[1:13:13] some input there's a very specific sort
[1:13:15] of layout of the information in space
[1:13:18] and the convolutional filters sort of
[1:13:20] act in space and so it's it's not like
[1:13:23] an attention in ATT ention is just a set
[1:13:26] of vectors out there in space they
[1:13:27] communicate and if you want them to have
[1:13:29] a notion of space you need to
[1:13:31] specifically add it which is what we've
[1:13:33] done when we calculated the um relative
[1:13:36] the positional encode encodings and
[1:13:38] added that information to the vectors
[1:13:40] the next thing that I hope is very clear
[1:13:41] is that the elements across the batch
[1:13:43] Dimension which are independent examples
[1:13:45] never talk to each other they're always
[1:13:47] processed independently and this is a
[1:13:49] batched matrix multiply that applies
[1:13:51] basically a matrix multiplication uh
[1:13:53] kind of in parallel across the batch
[1:13:54] dimension so maybe it would be more
[1:13:56] accurate to say that in this analogy of
[1:13:58] a directed graph we really have because
[1:14:00] the back size is four we really have
[1:14:03] four separate pools of eight nodes and
[1:14:05] those eight nodes only talk to each
[1:14:07] other but in total there's like 32 nodes
[1:14:08] that are being processed uh but there's
[1:14:11] um sort of four separate pools of eight
[1:14:13] you can look at it that way the next
[1:14:15] note is that here in the case of
[1:14:18] language modeling uh we have this
[1:14:20] specific uh structure of directed graph
[1:14:22] where the future tokens will not
[1:14:24] communicate to the Past tokens but this
[1:14:27] doesn't necessarily have to be the
[1:14:28] constraint in the general case and in
[1:14:30] fact in many cases you may want to have
[1:14:32] all of the uh noes talk to each other uh
[1:14:35] fully so as an example if you're doing
[1:14:37] sentiment analysis or something like
[1:14:38] that with a Transformer you might have a
[1:14:40] number of tokens and you may want to
[1:14:42] have them all talk to each other fully
[1:14:45] because later you are predicting for
[1:14:46] example the sentiment of the sentence
[1:14:49] and so it's okay for these NOS to talk
[1:14:50] to each other and so in those cases you
[1:14:53] will use an encoder block of self
[1:14:55] attention and uh all it means that it's
[1:14:58] an encoder block is that you will delete
[1:15:00] this line of code allowing all the noes
[1:15:02] to completely talk to each other what
[1:15:04] we're implementing here is sometimes
[1:15:06] called a decoder block and it's called a
[1:15:09] decoder because it is sort of like a
[1:15:12] decoding language and it's got this
[1:15:15] autor regressive format where you have
[1:15:17] to mask with the Triangular Matrix so
[1:15:19] that uh nodes from the future never talk
[1:15:22] to the Past because they would give away
[1:15:24] the answer
[1:15:25] and so basically in encoder blocks you
[1:15:27] would delete this allow all the noes to
[1:15:29] talk in decoder blocks this will always
[1:15:31] be present so that you have this
[1:15:33] triangular structure uh but both are
[1:15:35] allowed and attention doesn't care
[1:15:36] attention supports arbitrary
[1:15:38] connectivity between nodes the next
[1:15:40] thing I wanted to comment on is you keep
[1:15:41] me you keep hearing me say attention
[1:15:43] self attention Etc there's actually also
[1:15:45] something called cross attention what is
[1:15:47] the
[1:15:47] difference
[1:15:49] so basically the reason this attention
[1:15:52] is self attention is because because the
[1:15:55] keys queries and the values are all
[1:15:57] coming from the same Source from X so
[1:16:01] the same Source X produces Keys queries
[1:16:03] and values so these nodes are self
[1:16:05] attending but in principle attention is
[1:16:08] much more General than that so for
[1:16:10] example an encoder decoder Transformers
[1:16:12] uh you can have a case where the queries
[1:16:15] are produced from X but the keys and the
[1:16:17] values come from a whole separate
[1:16:18] external source and sometimes from uh
[1:16:21] encoder blocks that encode some context
[1:16:23] that we'd like to condition on
[1:16:25] and so the keys and the values will
[1:16:26] actually come from a whole separate
[1:16:28] Source those are nodes on the side and
[1:16:31] here we're just producing queries and
[1:16:32] we're reading off information from the
[1:16:34] side so cross attention is used when
[1:16:37] there's a separate source of nodes we'd
[1:16:40] like to pull information from into our
[1:16:42] nodes and it's self attention if we just
[1:16:45] have nodes that would like to look at
[1:16:46] each other and talk to each other so
[1:16:48] this attention here happens to be self
[1:16:51] attention but in principle um attention
[1:16:55] is a lot more General okay and the last
[1:16:57] note at this stage is if we come to the
[1:16:59] attention is all need paper here we've
[1:17:01] already implemented attention so given
[1:17:03] query key and value we've U multiplied
[1:17:06] the query and a key we've soft maxed it
[1:17:09] and then we are aggregating the values
[1:17:11] there's one more thing that we're
[1:17:12] missing here which is the dividing by
[1:17:13] one / square root of the head size the
[1:17:16] DK here is the head size why are they
[1:17:18] doing this finds this important so they
[1:17:21] call it the scaled attention and it's
[1:17:24] kind of like an important normalization
[1:17:25] to basically
[1:17:26] have the problem is if you have unit gsh
[1:17:29] and inputs so zero mean unit variance K
[1:17:32] and Q are unit gashin then if you just
[1:17:34] do we naively then you see that your we
[1:17:37] actually will be uh the variance will be
[1:17:38] on the order of head size which in our
[1:17:40] case is 16 but if you multiply by one
[1:17:43] over head size square root so this is
[1:17:45] square root and this is one
[1:17:47] over then the variance of we will be one
[1:17:50] so it will be
[1:17:52] preserved now why is this important
[1:17:54] you'll not notice that way
[1:17:56] here will feed into
[1:17:58] softmax and so it's really important
[1:18:00] especially at initialization that we be
[1:18:03] fairly diffuse so in our case here we
[1:18:06] sort of locked out here and we had a
[1:18:10] fairly diffuse numbers here so um like
[1:18:13] this now the problem is that because of
[1:18:15] softmax if weight takes on very positive
[1:18:18] and very negative numbers inside it
[1:18:20] softmax will actually converge towards
[1:18:22] one hot vectors and so I can illustrate
[1:18:25] that here um say we are applying softmax
[1:18:29] to a tensor of values that are very
[1:18:31] close to zero then we're going to get a
[1:18:33] diffuse thing out of
[1:18:34] softmax but the moment I take the exact
[1:18:36] same thing and I start sharpening it
[1:18:38] making it bigger by multiplying these
[1:18:40] numbers by eight for example you'll see
[1:18:42] that the softmax will start to sharpen
[1:18:44] and in fact it will sharpen towards the
[1:18:46] max so it will sharpen towards whatever
[1:18:48] number here is the highest and so um
[1:18:51] basically we don't want these values to
[1:18:52] be too extreme especially at
[1:18:53] initialization otherwise softmax will be
[1:18:55] way too peaky and um you're basically
[1:18:58] aggregating um information from like a
[1:19:01] single node every node just agregates
[1:19:03] information from a single other node
[1:19:04] that's not what we want especially at
[1:19:06] initialization and so the scaling is
[1:19:08] used just to control the variance at
[1:19:11] initialization okay so having said all
[1:19:13] that let's now take our self attention
[1:19:15] knowledge and let's uh take it for a
[1:19:17] spin so here in the code I created this
[1:19:19] head module and it implements a single
[1:19:22] head of self attention so you give it a
[1:19:24] head size and then here it creates the
[1:19:26] key query and the value linear layers
[1:19:29] typically people don't use biases in
[1:19:31] these uh so those are the linear
[1:19:33] projections that we're going to apply to
[1:19:34] all of our nodes now here I'm creating
[1:19:37] this Trill variable Trill is not a
[1:19:40] parameter of the module so in sort of
[1:19:41] pytorch naming conventions uh this is
[1:19:43] called a buffer it's not a parameter and
[1:19:46] you have to call it you have to assign
[1:19:47] it to the module using a register buffer
[1:19:49] so that creates the trill uh the triang
[1:19:52] lower triangular Matrix and we're given
[1:19:55] the input X this should look very
[1:19:56] familiar now we calculate the keys the
[1:19:58] queries we C calculate the attention
[1:20:00] scores inside way uh we normalize it so
[1:20:03] we're using scaled attention here then
[1:20:06] we make sure that uh future doesn't
[1:20:08] communicate with the past so this makes
[1:20:10] it a decoder block and then softmax and
[1:20:13] then aggregate the value and
[1:20:15] output then here in the language model
[1:20:17] I'm creating a head in the Constructor
[1:20:20] and I'm calling it self attention head
[1:20:22] and the head size I'm going to keep as
[1:20:24] the same and embed just for
[1:20:27] now and then here once we've encoded the
[1:20:31] information with the token embeddings
[1:20:32] and the position embeddings we're simply
[1:20:34] going to feed it into the self attention
[1:20:36] head and then the output of that is
[1:20:38] going to go into uh the decoder language
[1:20:42] modeling head and create the logits so
[1:20:44] this the sort of the simplest way to
[1:20:46] plug in a self attention component uh
[1:20:49] into our Network right now I had to make
[1:20:51] one more change which is that here in
[1:20:55] the generate uh we have to make sure
[1:20:57] that our idx that we feed into the model
[1:21:01] because now we're using positional
[1:21:02] embeddings we can never have more than
[1:21:04] block size coming in because if idx is
[1:21:07] more than block size then our position
[1:21:09] embedding table is going to run out of
[1:21:11] scope because it only has embeddings for
[1:21:12] up to block size and so therefore I
[1:21:15] added some uh code here to crop the
[1:21:17] context that we're going to feed into
[1:21:20] self um so that uh we never pass in more
[1:21:23] than block siiz elements
[1:21:25] so those are the changes and let's Now
[1:21:27] train the network okay so I also came up
[1:21:29] to the script here and I decreased the
[1:21:30] learning rate because uh the self
[1:21:32] attention can't tolerate very very high
[1:21:34] learning rates and then I also increased
[1:21:36] number of iterations because the
[1:21:37] learning rate is lower and then I
[1:21:39] trained it and previously we were only
[1:21:41] able to get to up to 2.5 and now we are
[1:21:43] down to 2.4 so we definitely see a
[1:21:46] little bit of an improvement from 2.5 to
[1:21:48] 2.4 roughly uh but the text is still not
[1:21:51] amazing so clearly the self attention
[1:21:53] head is doing some useful communication
[1:21:56] but um we still have a long way to go
[1:21:59] okay so now we've implemented the scale.
[1:22:01] product attention now next up and the
[1:22:02] attention is all you need paper there's
[1:22:05] something called multi-head attention
[1:22:07] and what is multi-head attention it's
[1:22:09] just applying multiple attentions in
[1:22:11] parallel and concatenating their results
[1:22:13] so they have a little bit of diagram
[1:22:15] here I don't know if this is super clear
[1:22:18] it's really just multiple attentions in
[1:22:20] parallel so let's Implement that fairly
[1:22:23] straightforward
[1:22:25] if we want a multi-head attention then
[1:22:27] we want multiple heads of self attention
[1:22:28] running in parallel so in pytorch we can
[1:22:32] do this by simply creating multiple
[1:22:35] heads so however heads how however many
[1:22:38] heads you want and then what is the head
[1:22:39] size of each and then we run all of them
[1:22:43] in parallel into a list and simply
[1:22:46] concatenate all of the outputs and we're
[1:22:48] concatenating over the channel
[1:22:50] Dimension so the way this looks now is
[1:22:53] we don't have just a single ATT
[1:22:56] that uh has a hit size of 32 because
[1:22:59] remember n Ed is
[1:23:00] 32 instead of having one Communication
[1:23:03] channel we now have four communication
[1:23:06] channels in parallel and each one of
[1:23:08] these communication channels typically
[1:23:10] will be uh smaller uh correspondingly so
[1:23:14] because we have four communication
[1:23:15] channels we want eight dimensional self
[1:23:18] attention and so from each Communication
[1:23:20] channel we're going to together eight
[1:23:22] dimensional vectors and then we have
[1:23:23] four of them and that concatenates to
[1:23:25] give us 32 which is the original and
[1:23:28] embed and so this is kind of similar to
[1:23:30] um if you're familiar with convolutions
[1:23:32] this is kind of like a group convolution
[1:23:34] uh because basically instead of having
[1:23:36] one large convolution we do convolution
[1:23:38] in groups and uh that's multi-headed
[1:23:40] self
[1:23:41] attention and so then here we just use
[1:23:44] essay heads self attention heads instead
[1:23:47] now I actually ran it and uh scrolling
[1:23:51] down I ran the same thing and then we
[1:23:53] now get this down to 2.28 roughly and
[1:23:57] the output is still the generation is
[1:23:58] still not amazing but clearly the
[1:24:00] validation loss is improving because we
[1:24:02] were at 2.4 just now and so it helps to
[1:24:05] have multiple communication channels
[1:24:07] because obviously these tokens have a
[1:24:09] lot to talk about they want to find the
[1:24:11] consonants the vowels they want to find
[1:24:13] the vowels just from certain positions
[1:24:15] uh they want to find any kinds of
[1:24:17] different things and so it helps to
[1:24:19] create multiple independent channels of
[1:24:20] communication gather lots of different
[1:24:22] types of data and then uh decode the
[1:24:25] output now going back to the paper for a
[1:24:27] second of course I didn't explain this
[1:24:28] figure in full detail but we are
[1:24:30] starting to see some components of what
[1:24:32] we've already implemented we have the
[1:24:33] positional encodings the token encodings
[1:24:35] that add we have the masked multi-headed
[1:24:37] attention implemented now here's another
[1:24:41] multi-headed attention which is a cross
[1:24:42] attention to an encoder which we haven't
[1:24:45] we're not going to implement in this
[1:24:46] case I'm going to come back to that
[1:24:48] later but I want you to notice that
[1:24:50] there's a feed forward part here and
[1:24:52] then this is grouped into a block that
[1:24:53] gets repeat it again and again now the
[1:24:56] feedforward part here is just a simple
[1:24:57] uh multi-layer perceptron
[1:25:00] um so the multi-headed so here position
[1:25:04] wise feed forward networks is just a
[1:25:06] simple little MLP so I want to start
[1:25:08] basically in a similar fashion also
[1:25:10] adding computation into the network and
[1:25:13] this computation is on a per node level
[1:25:16] so I've already implemented it and you
[1:25:18] can see the diff highlighted on the left
[1:25:20] here when I've added or changed things
[1:25:22] now before we had the self multi-headed
[1:25:25] self attention that did the
[1:25:26] communication but we went way too fast
[1:25:28] to calculate the logits so the tokens
[1:25:31] looked at each other but didn't really
[1:25:32] have a lot of time to think on what they
[1:25:35] found from the other tokens and so what
[1:25:38] I've implemented here is a little feet
[1:25:40] forward single layer and this little
[1:25:42] layer is just a linear followed by a Rel
[1:25:45] nonlinearity and that's that's it so
[1:25:48] it's just a little layer and then I call
[1:25:50] it feed
[1:25:52] forward um and embed
[1:25:54] and then this feed forward is just
[1:25:56] called sequentially right after the self
[1:25:58] attention so we self attend then we feed
[1:26:01] forward and you'll notice that the feet
[1:26:02] forward here when it's applying linear
[1:26:04] this is on a per token level all the
[1:26:06] tokens do this independently so the self
[1:26:09] attention is the communication and then
[1:26:11] once they've gathered all the data now
[1:26:13] they need to think on that data
[1:26:15] individually and so that's what feed
[1:26:16] forward is doing and that's why I've
[1:26:18] added it here now when I train this the
[1:26:21] validation LW actually continues to go
[1:26:23] down now to 2. 24 which is down from
[1:26:26] 2.28 uh the output still look kind of
[1:26:28] terrible but at least we've improved the
[1:26:31] situation and so as a preview we're
[1:26:34] going to now start to intersperse the
[1:26:37] communication with the computation and
[1:26:39] that's also what the Transformer does
[1:26:42] when it has blocks that communicate and
[1:26:44] then compute and it groups them and
[1:26:46] replicates them okay so let me show you
[1:26:49] what we'd like to do we'd like to do
[1:26:51] something like this we have a block and
[1:26:53] this block is is basically this part
[1:26:55] here except for the cross
[1:26:57] attention now the block basically
[1:26:59] intersperses communication and then
[1:27:01] computation the computation the
[1:27:03] communication is done using multi-headed
[1:27:05] selfelf attention and then the
[1:27:07] computation is done using a feed forward
[1:27:08] Network on all the tokens
[1:27:11] independently now what I've added here
[1:27:14] also is you'll
[1:27:16] notice this takes the number of
[1:27:18] embeddings in the embedding Dimension
[1:27:19] and number of heads that we would like
[1:27:21] which is kind of like group size in
[1:27:22] group convolution and and I'm saying
[1:27:24] that number of heads we'd like is four
[1:27:26] and so because this is 32 we calculate
[1:27:29] that because this is 32 the number of
[1:27:31] heads should be four um the head size
[1:27:34] should be eight so that everything sort
[1:27:36] of works out Channel wise um so this is
[1:27:39] how the Transformer structures uh sort
[1:27:41] of the uh the sizes typically so the
[1:27:44] head size will become eight and then
[1:27:45] this is how we want to intersperse them
[1:27:47] and then here I'm trying to create
[1:27:49] blocks which is just a sequential
[1:27:51] application of block block block so that
[1:27:53] we're interspersing communication feed
[1:27:55] forward many many times and then finally
[1:27:57] we decode now I actually tried to run
[1:28:01] this and the problem is this doesn't
[1:28:02] actually give a very good uh answer and
[1:28:05] very good result and the reason for that
[1:28:07] is we're start starting to actually get
[1:28:09] like a pretty deep neural net and deep
[1:28:11] neural Nets uh suffer from optimization
[1:28:13] issues and I think that's what we're
[1:28:14] kind of like slightly starting to run
[1:28:16] into so we need one more idea that we
[1:28:18] can borrow from the um Transformer paper
[1:28:21] to resolve those difficulties now there
[1:28:23] are two optimizations that dramatically
[1:28:25] help with the depth of these networks
[1:28:27] and make sure that the networks remain
[1:28:29] optimizable let's talk about the first
[1:28:31] one the first one in this diagram is you
[1:28:33] see this Arrow here and then this arrow
[1:28:36] and this Arrow those are skip
[1:28:38] connections or sometimes called residual
[1:28:40] connections they come from this paper uh
[1:28:43] the presidual learning for image
[1:28:44] recognition from about
[1:28:46] 2015 uh that introduced the concept now
[1:28:51] these are basically what it means is you
[1:28:53] transform data but then you have a skip
[1:28:55] connection with addition from the
[1:28:57] previous features now the way I like to
[1:29:00] visualize it uh that I prefer is the
[1:29:03] following here the computation happens
[1:29:05] from the top to bottom and basically you
[1:29:08] have this uh residual pathway and you
[1:29:11] are free to Fork off from the residual
[1:29:13] pathway perform some computation and
[1:29:15] then project back to the residual
[1:29:16] pathway via addition and so you go from
[1:29:19] the the uh inputs to the targets only
[1:29:22] via plus and plus plus and the reason
[1:29:25] this is useful is because during back
[1:29:27] propagation remember from our microG
[1:29:29] grad video earlier addition distributes
[1:29:32] gradients equally to both of its
[1:29:34] branches that that fed as the input and
[1:29:37] so the supervision or the gradients from
[1:29:40] the loss basically hop through every
[1:29:43] addition node all the way to the input
[1:29:46] and then also Fork off into the residual
[1:29:50] blocks but basically you have this
[1:29:52] gradient Super Highway that goes
[1:29:53] directly from the supervision all the
[1:29:55] way to the input unimpeded and then
[1:29:58] these viral blocks are usually
[1:29:59] initialized in the beginning so they
[1:30:01] contribute very very little if anything
[1:30:03] to the residual pathway they they are
[1:30:05] initialized that way so in the beginning
[1:30:07] they are sort of almost kind of like not
[1:30:09] there but then during the optimization
[1:30:11] they come online over time and they uh
[1:30:14] start to contribute but at least at the
[1:30:17] initialization you can go from directly
[1:30:19] supervision to the input gradient is
[1:30:21] unimpeded and just flows and then the
[1:30:23] blocks over time
[1:30:24] kick in and so that dramatically helps
[1:30:27] with the optimization so let's implement
[1:30:29] this so coming back to our block here
[1:30:31] basically what we want to do is we want
[1:30:33] to do xal
[1:30:35] X+ self attention and xal X+ self. feed
[1:30:39] forward so this is X and then we Fork
[1:30:43] off and do some communication and come
[1:30:45] back and we Fork off and we do some
[1:30:46] computation and come back so those are
[1:30:49] residual connections and then swinging
[1:30:51] back up here we also have to introd use
[1:30:54] this projection so nn.
[1:30:57] linear and uh this is going to be
[1:31:00] from after we concatenate this this is
[1:31:03] the prze and embed so this is the output
[1:31:05] of the self tension itself but then we
[1:31:08] actually want the uh to apply the
[1:31:11] projection and that's the
[1:31:13] result so the projection is just a
[1:31:15] linear transformation of the outcome of
[1:31:16] this
[1:31:17] layer so that's the projection back into
[1:31:20] the virual pathway and then here in a
[1:31:22] feet forward it's going to be the same
[1:31:23] same thing I could have a a self doot
[1:31:26] projection here as well but let me just
[1:31:28] simplify it and let me uh couple it
[1:31:32] inside the same sequential container and
[1:31:34] so this is the projection layer going
[1:31:36] back into the residual
[1:31:38] pathway and
[1:31:40] so that's uh well that's it so now we
[1:31:43] can train this so I implemented one more
[1:31:44] small change when you look into the
[1:31:47] paper again you see that the
[1:31:49] dimensionality of input and output is
[1:31:51] 512 for them and they're saying that the
[1:31:53] inner layer here in the feet forward has
[1:31:55] dimensionality of 248 so there's a
[1:31:57] multiplier of four and so the inner
[1:32:00] layer of the feet forward Network should
[1:32:02] be multiplied by four in terms of
[1:32:04] Channel sizes so I came here and I
[1:32:06] multiplied four times embed here for the
[1:32:08] feed forward and then from four times
[1:32:10] nmed coming back down to nmed when we go
[1:32:13] back to the pro uh to the projection so
[1:32:15] adding a bit of computation here and
[1:32:17] growing that layer that is in the
[1:32:19] residual block on the side of the
[1:32:21] residual
[1:32:22] pathway and then I train this and we
[1:32:24] actually get down all the way to uh 2.08
[1:32:27] validation loss and we also see that
[1:32:29] network is starting to get big enough
[1:32:30] that our train loss is getting ahead of
[1:32:32] validation loss so we're starting to see
[1:32:33] like a little bit of
[1:32:35] overfitting and um our our
[1:32:38] um uh Generations here are still not
[1:32:41] amazing but at least you see that we can
[1:32:42] see like is here this now grief syn like
[1:32:46] this starts to almost look like English
[1:32:48] so um yeah we're starting to really get
[1:32:50] there okay and the second Innovation
[1:32:52] that is very helpful for optimizing very
[1:32:54] deep neural networks is right here so we
[1:32:57] have this addition now that's the
[1:32:58] residual part but this Norm is referring
[1:33:00] to something called layer Norm so layer
[1:33:03] Norm is implemented in pytorch it's a
[1:33:04] paper that came out a while back here
[1:33:09] um and layer Norm is very very similar
[1:33:11] to bash Norm so remember back to our
[1:33:14] make more series part three we
[1:33:16] implemented bash
[1:33:17] normalization and uh bash normalization
[1:33:19] basically just made sure that um Across
[1:33:22] The Bash dimension any individual neuron
[1:33:25] had unit uh Gan um distribution so it
[1:33:30] was zero mean and unit standard
[1:33:32] deviation one standard deviation output
[1:33:35] so what I did here is I'm copy pasting
[1:33:37] the bashor 1D that we developed in our
[1:33:39] make more series and see here we can
[1:33:42] initialize for example this module and
[1:33:44] we can have a batch of 32 100
[1:33:47] dimensional vectors feeding through the
[1:33:48] bachor layer so what this does is it
[1:33:52] guarantees that when we look at just the
[1:33:54] zeroth column it's a zero mean one
[1:33:58] standard deviation so it's normalizing
[1:34:00] every single column of this uh input now
[1:34:04] the rows are not uh going to be
[1:34:06] normalized by default because we're just
[1:34:08] normalizing columns so let's now
[1:34:10] Implement layer Norm uh it's very
[1:34:12] complicated look we come here we change
[1:34:15] this from zero to one so we don't
[1:34:18] normalize The Columns we normalize the
[1:34:20] rows and now we've implemented layer
[1:34:23] Norm
[1:34:25] so now the columns are not going to be
[1:34:28] normalized um but the rows are going to
[1:34:31] be normalized for every individual
[1:34:33] example it's 100 dimensional Vector is
[1:34:35] normalized uh in this way and because
[1:34:38] our computation Now does not span across
[1:34:40] examples we can delete all of this
[1:34:43] buffers stuff uh because uh we can
[1:34:45] always apply this operation and don't
[1:34:48] need to maintain any running buffers so
[1:34:50] we don't need the
[1:34:52] buffers uh we
[1:34:54] don't There's no distinction between
[1:34:56] training and test
[1:34:58] time uh and we don't need these running
[1:35:00] buffers we do keep gamma and beta we
[1:35:03] don't need the momentum we don't care if
[1:35:05] it's training or not and this is now a
[1:35:08] layer
[1:35:09] norm and it normalizes the rows instead
[1:35:12] of the columns and this here is
[1:35:15] identical to basically this here so
[1:35:19] let's now Implement layer Norm in our
[1:35:21] Transformer before I incorporate the
[1:35:23] layer Norm I just wanted to note that as
[1:35:25] I said very few details about the
[1:35:27] Transformer have changed in the last 5
[1:35:28] years but this is actually something
[1:35:30] that slightly departs from the original
[1:35:31] paper you see that the ADD and Norm is
[1:35:34] applied after the
[1:35:36] transformation but um in now it is a bit
[1:35:40] more uh basically common to apply the
[1:35:42] layer Norm before the transformation so
[1:35:44] there's a reshuffling of the layer Norms
[1:35:46] uh so this is called the prorm
[1:35:48] formulation and that's the one that
[1:35:49] we're going to implement as well so
[1:35:50] select deviation from the original paper
[1:35:53] basically we need two layer Norms layer
[1:35:55] Norm one is uh NN do layer norm and we
[1:35:59] tell it how many um what is the
[1:36:01] embedding Dimension and we need the
[1:36:03] second layer norm and then here the
[1:36:06] layer Norms are applied immediately on X
[1:36:09] so self. layer Norm one applied on X and
[1:36:13] self. layer Norm two applied on X before
[1:36:15] it goes into self attention and feed
[1:36:18] forward and uh the size of the layer
[1:36:20] Norm here is an ed so 32 so when the
[1:36:23] layer Norm is normalizing our features
[1:36:26] it is uh the normalization here uh
[1:36:30] happens the mean and the variance are
[1:36:32] taken over 32 numbers so the batch and
[1:36:34] the time act as batch Dimensions both of
[1:36:37] them so this is kind of like a per token
[1:36:40] um transformation that just normalizes
[1:36:42] the features and makes them a unit mean
[1:36:46] uh unit Gan at
[1:36:48] initialization but of course because
[1:36:50] these layer Norms inside it have these
[1:36:52] gamma and beta training
[1:36:54] parameters uh the layer Norm will U
[1:36:57] eventually create outputs that might not
[1:36:59] be unit gion but the optimization will
[1:37:01] determine that so for now this is the uh
[1:37:05] this is incorporating the layer norms
[1:37:06] and let's train them on okay so I let it
[1:37:09] run and we see that we get down to 2.06
[1:37:12] which is better than the previous 2.08
[1:37:14] so a slight Improvement by adding the
[1:37:15] layer norms and I'd expect that they
[1:37:17] help uh even more if we had bigger and
[1:37:19] deeper Network one more thing I forgot
[1:37:21] to add is that there should be a layer
[1:37:23] Norm here also typically as at the end
[1:37:26] of the Transformer and right before the
[1:37:28] final uh linear layer that decodes into
[1:37:31] vocabulary so I added that as well so at
[1:37:35] this stage we actually have a pretty
[1:37:36] complete uh Transformer according to the
[1:37:38] original paper and it's a decoder only
[1:37:40] Transformer I'll I'll talk about that in
[1:37:42] a second uh but at this stage uh the
[1:37:44] major pieces are in place so we can try
[1:37:46] to scale this up and see how well we can
[1:37:47] push this number now in order to scale
[1:37:50] out the model I had to perform some
[1:37:51] cosmetic changes here to make it nicer
[1:37:54] so I introduced this variable called n
[1:37:56] layer which just specifies how many
[1:37:57] layers of the blocks we're going to have
[1:38:01] I created a bunch of blocks and we have
[1:38:02] a new variable number of heads as well I
[1:38:05] pulled out the layer Norm here and uh so
[1:38:07] this is identical now one thing that I
[1:38:10] did briefly change is I added a Dropout
[1:38:13] so Dropout is something that you can add
[1:38:15] right before the residual connection
[1:38:17] back right before the connection back
[1:38:19] into the residual pathway so we can drop
[1:38:22] out that as l layer here we can drop out
[1:38:26] uh here at the end of the multi-headed
[1:38:27] exension as well and we can also drop
[1:38:30] out here uh when we calculate the um
[1:38:34] basically affinities and after the
[1:38:36] softmax we can drop out some of those so
[1:38:38] we can randomly prevent some of the
[1:38:40] nodes from
[1:38:41] communicating and so Dropout uh comes
[1:38:43] from this paper from 2014 or so and
[1:38:49] basically it takes your neural
[1:38:50] nut and it randomly every forward
[1:38:53] backward pass shuts off some subset of
[1:38:56] uh neurons so randomly drops them to
[1:38:59] zero and trains without them and what
[1:39:02] this does effectively is because the
[1:39:04] mask of what's being dropped out is
[1:39:06] changed every single forward backward
[1:39:07] pass it ends up kind of uh training an
[1:39:11] ensemble of sub networks and then at
[1:39:13] test time everything is fully enabled
[1:39:15] and kind of all of those sub networks
[1:39:16] are merged into a single Ensemble if you
[1:39:18] can if you want to think about it that
[1:39:20] way so I would read the paper to get the
[1:39:22] full detail for now we're just going to
[1:39:24] stay on the level of this is a
[1:39:25] regularization technique and I added it
[1:39:28] because I'm about to scale up the model
[1:39:30] quite a bit and I was concerned about
[1:39:32] overfitting so now when we scroll up to
[1:39:34] the top uh we'll see that I changed a
[1:39:36] number of hyper parameters here about
[1:39:38] our neural nut so I made the batch size
[1:39:40] be much larger now it's 64 I changed the
[1:39:43] block size to be 256 so previously it
[1:39:46] was just eight eight characters of
[1:39:47] context now it is 256 characters of
[1:39:50] context to predict the 257th
[1:39:54] uh I brought down the learning rate a
[1:39:55] little bit because the neural net is now
[1:39:57] much bigger so I brought down the
[1:39:58] learning rate the embedding Dimension is
⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.