[0:00] hi everyone so by now you have probably
[0:02] heard of chat GPT it has taken the world
[0:04] and AI Community by storm and it is a
[0:07] system that allows you to interact with
[0:09] an AI and give it text based tasks so
[0:12] for example we can ask chat GPT to write
[0:15] us a small Hau about how important it is
[0:16] that people understand Ai and then they
[0:18] can use it to improve the world and make
[0:20] it more prosperous so when we run this
[0:23] AI knowledge brings prosperity for all
[0:25] to see Embrace its
[0:27] power okay not bad and so you could see
[0:29] that chpt went from left to right and
[0:32] generated all these words SE sort of
[0:35] sequentially now I asked it already the
[0:37] exact same prompt a little bit earlier
[0:39] and it generated a slightly different
[0:41] outcome ai's power to grow ignorance
[0:44] holds us back learn Prosperity weights
[0:47] so uh pretty good in both cases and
[0:49] slightly different so you can see that
[0:50] chat GPT is a probabilistic system and
[0:52] for any one prompt it can give us
[0:54] multiple answers sort of uh replying to
[0:57] it now this is just one example of a
[0:59] problem people have come up with many
[1:01] many examples and there are entire
[1:03] websites that index interactions with
[1:06] chpt and so many of them are quite
[1:08] humorous explain HTML to me like I'm a
[1:10] dog uh write release notes for chess 2
[1:14] write a note about Elon Musk buying a
[1:16] Twitter and so on so as an example uh
[1:20] please write a breaking news article
[1:21] about a leaf falling from a
[1:23] tree uh and a shocking turn of events a
[1:26] leaf has fallen from a tree in the local
[1:28] park Witnesses report that the leaf
[1:30] which was previously attached to a
[1:31] branch of a tree attached itself and
[1:33] fell to the ground very dramatic so you
[1:36] can see that this is a pretty remarkable
[1:37] system and it is what we call a language
[1:40] model uh because it um it models the
[1:43] sequence of words or characters or
[1:46] tokens more generally and it knows how
[1:49] sort of words follow each other in
[1:50] English language and so from its
[1:52] perspective what it is doing is it is
[1:55] completing the sequence so I give it the
[1:57] start of a sequence and it completes the
[2:00] sequence with the outcome and so it's a
[2:02] language model in that sense now I would
[2:05] like to focus on the under the hood of
[2:07] um under the hood components of what
[2:09] makes CH GPT work so what is the neural
[2:12] network under the hood that models the
[2:14] sequence of these words and that comes
[2:17] from this paper called attention is all
[2:19] you need in 2017 a landmark paper a
[2:23] landmark paper in AI that produced and
[2:25] proposed the Transformer
[2:27] architecture so GPT is uh short for
[2:31] generally generatively pre-trained
[2:33] Transformer so Transformer is the neuron
[2:35] nut that actually does all the heavy
[2:36] lifting under the hood it comes from
[2:39] this paper in 2017 now if you read this
[2:41] paper this uh reads like a pretty random
[2:44] machine translation paper and that's
[2:46] because I think the authors didn't fully
[2:47] anticipate the impact that the
[2:49] Transformer would have on the field and
[2:51] this architecture that they produced in
[2:52] the context of machine translation in
[2:54] their case actually ended up taking over
[2:57] uh the rest of AI in the next 5 years
[3:00] after and so this architecture with
[3:02] minor changes was copy pasted into a
[3:05] huge amount of applications in AI in
[3:07] more recent years and that includes at
[3:10] the core of chat GPT now we are not
[3:13] going to what I'd like to do now is I'd
[3:15] like to build out something like chat
[3:17] GPT but uh we're not going to be able to
[3:19] of course reproduce chat GPT this is a
[3:21] very serious production grade system it
[3:23] is trained on uh a good chunk of
[3:26] internet and then there's a lot of uh
[3:29] pre-training and fine-tuning stages to
[3:31] it and so it's very complicated what I'd
[3:33] like to focus on is just to train a
[3:36] Transformer based language model and in
[3:38] our case it's going to be a character
[3:40] level language model I still think that
[3:43] is uh very educational with respect to
[3:45] how these systems work so I don't want
[3:47] to train on the chunk of Internet we
[3:48] need a smaller data set in this case I
[3:51] propose that we work with uh my favorite
[3:53] toy data set it's called tiny
[3:55] Shakespeare and um what it is is
[3:57] basically it's a concatenation of all of
[3:59] the works of sh Shakespeare in my
[4:00] understanding and so this is all of
[4:02] Shakespeare in a single file uh this
[4:05] file is about 1 megab and it's just all
[4:07] of
[4:08] Shakespeare and what we are going to do
[4:10] now is we're going to basically model
[4:12] how these characters uh follow each
[4:14] other so for example given a chunk of
[4:16] these characters like this uh given some
[4:19] context of characters in the past the
[4:22] Transformer neural network will look at
[4:24] the characters that I've highlighted and
[4:26] is going to predict that g is likely to
[4:28] come next in the sequence and it's going
[4:30] to do that because we're going to train
[4:31] that Transformer on Shakespeare and it's
[4:34] just going to try to produce uh
[4:36] character sequences that look like this
[4:39] and in that process is going to model
[4:40] all the patterns inside this data so
[4:43] once we've trained the system i' just
[4:45] like to give you a preview we can
[4:47] generate infinite Shakespeare and of
[4:49] course it's a fake thing that looks kind
[4:51] of like
[4:53] Shakespeare
[4:55] um apologies for there's some Jank that
[4:59] I'm not able to resolve in in here but
[5:02] um you can see how this is going
[5:05] character by character and it's kind of
[5:07] like predicting Shakespeare like
[5:09] language so verily my Lord the sites
[5:12] have left the again the king coming with
[5:15] my curses with precious pale and then
[5:19] tranos say something else Etc and this
[5:21] is just coming out of the Transformer in
[5:23] a very similar manner as it would come
[5:25] out in chat GPT in our case character by
[5:27] character in chat GPT uh it's coming out
[5:31] on the token by token level and tokens
[5:33] are these sort of like little subword
[5:35] pieces so they're not Word level they're
[5:36] kind of like word chunk
[5:38] level um and now I've already written
[5:43] this entire code uh to train these
[5:45] Transformers um and it is in a GitHub
[5:48] repository that you can find and it's
[5:50] called nanog
[5:51] GPT so nanog GPT is a repository that
[5:54] you can find in my GitHub and it's a
[5:56] repository for training Transformers um
[5:59] on any given text and what I think is
[6:02] interesting about it because there's
[6:03] many ways to train Transformers but this
[6:05] is a very simple implementation so it's
[6:06] just two files of 300 lines of code each
[6:10] one file defines the GPT model the
[6:12] Transformer and one file trains it on
[6:14] some given Text data set and here I'm
[6:17] showing that if you train it on a open
[6:18] web Text data set which is a fairly
[6:20] large data set of web pages then I
[6:22] reproduce the the performance of
[6:25] gpt2 so gpt2 is an early version of open
[6:29] AI GPT uh from 2017 if I recall
[6:32] correctly and I've only so far
[6:34] reproduced the the smallest 124 million
[6:36] parameter model uh but basically this is
[6:38] just proving that the codebase is
[6:39] correctly arranged and I'm able to load
[6:42] the uh neural network weights that openi
[6:45] has released later so you can take a
[6:48] look at the finished code here in N GPT
[6:50] but what I would like to do in this
[6:51] lecture is I would like to basically uh
[6:55] write this repository from scratch so
[6:57] we're going to begin with an empty file
[6:59] and we're we're going to define a
[7:00] Transformer piece by piece we're going
[7:03] to train it on the tiny Shakespeare data
[7:05] set and we'll see how we can then uh
[7:08] generate infinite Shakespeare and of
[7:10] course this can copy paste to any
[7:12] arbitrary Text data set uh that you like
[7:14] uh but my goal really here is to just
[7:16] make you understand and appreciate uh
[7:18] how under the hood chat GPT works and um
[7:22] really all that's required is a
[7:24] Proficiency in Python and uh some basic
[7:27] understanding of um calculus and
[7:29] statistics
[7:30] and it would help if you also see my
[7:32] previous videos on the same YouTube
[7:34] channel in particular my make more
[7:35] series where I um Define smaller and
[7:40] simpler neural network language models
[7:42] uh so multi perceptrons and so on it
[7:45] really introduces the language modeling
[7:46] framework and then uh here in this video
[7:49] we're going to focus on the Transformer
[7:50] neural network itself okay so I created
[7:53] a new Google collab uh jup notebook here
[7:57] and this will allow me to later easily
[7:58] share this code that we're going to
[8:00] develop together uh with you so you can
[8:01] follow along so this will be in a video
[8:03] description uh later now here I've just
[8:07] done some preliminaries I downloaded the
[8:09] data set the tiny Shakespeare data set
[8:10] at this URL and you can see that it's
[8:12] about a 1 Megabyte file then here I open
[8:15] the input.txt file and just read in all
[8:17] the text of the string and we see that
[8:20] we are working with 1 million characters
[8:22] roughly and the first 1,000 characters
[8:24] if we just print them out are basically
[8:26] what you would expect this is the first
[8:28] 1,000 characters of the tiny Shakespeare
[8:30] data set roughly up to here so so far so
[8:34] good next we're going to take this text
[8:37] and the text is a sequence of characters
[8:39] in Python so when I call the set
[8:41] Constructor on it I'm just going to get
[8:44] the set of all the characters that occur
[8:46] in this text and then I call list on
[8:49] that to create a list of those
[8:51] characters instead of just a set so that
[8:53] I have an ordering an arbitrary ordering
[8:56] and then I sort that so basically we get
[8:59] just all the characters that occur in
[9:00] the entire data set and they're sorted
[9:02] now the number of them is going to be
[9:04] our vocabulary size these are the
[9:06] possible elements of our sequences and
[9:09] we see that when I print here the
[9:11] characters there's 65 of them in total
[9:14] there's a space character and then all
[9:16] kinds of special characters and then U
[9:19] capitals and lowercase letters so that's
[9:21] our vocabulary and that's the sort of
[9:23] like possible uh characters that the
[9:25] model can see or emit okay so next we
[9:29] will would like to develop some strategy
[9:31] to tokenize the input text now when
[9:35] people say tokenize they mean convert
[9:36] the raw text as a string to some
[9:39] sequence of integers According to some
[9:41] uh notebook According to some vocabulary
[9:43] of possible elements so as an example
[9:46] here we are going to be building a
[9:48] character level language model so we're
[9:49] simply going to be translating
[9:50] individual characters into integers so
[9:53] let me show you uh a chunk of code that
[9:55] sort of does that for us so we're
[9:57] building both the encoder and the
[9:58] decoder
[10:00] and let me just talk through what's
[10:01] happening
[10:02] here when we encode an arbitrary text
[10:05] like hi there we're going to receive a
[10:08] list of integers that represents that
[10:10] string so for example 46 47 Etc and then
[10:14] we also have the reverse mapping so we
[10:17] can take this list and decode it to get
[10:20] back the exact same string so it's
[10:22] really just like a translation to
[10:24] integers and back for arbitrary string
[10:26] and for us it is done on a character
[10:28] level
[10:30] now the way this was achieved is we just
[10:31] iterate over all the characters here and
[10:34] create a lookup table from the character
[10:35] to the integer and vice versa and then
[10:38] to encode some string we simply
[10:40] translate all the characters
[10:41] individually and to decode it back we
[10:44] use the reverse mapping and concatenate
[10:46] all of it now this is only one of many
[10:49] possible encodings or many possible sort
[10:51] of tokenizers and it's a very simple one
[10:54] but there's many other schemas that
[10:55] people have come up with in practice so
[10:57] for example Google uses a sentence
[10:59] piece uh so sentence piece will also
[11:02] encode text into um integers but in a
[11:05] different schema and using a different
[11:08] vocabulary and sentence piece is a
[11:10] subword uh sort of tokenizer and what
[11:13] that means is that um you're not
[11:15] encoding entire words but you're not
[11:17] also encoding individual characters it's
[11:19] it's a subword unit level and that's
[11:22] usually what's adopted in practice for
[11:24] example also openai has this Library
[11:26] called tick token that uses a bite pair
[11:28] encode
[11:29] tokenizer um and that's what GPT uses
[11:33] and you can also just encode words into
[11:35] like hell world into a list of integers
[11:38] so as an example I'm using the Tik token
[11:40] Library here I'm getting the encoding
[11:43] for gpt2 or that was used for gpt2
[11:46] instead of just having 65 possible
[11:48] characters or tokens they have 50,000
[11:51] tokens and so when they encode the exact
[11:54] same string High there we only get a
[11:57] list of three integers but those
[11:59] integers are not between 0 and 64 they
[12:01] are between Z and 5,
[12:05] 5,256 so basically you can trade off the
[12:09] code book size and the sequence lengths
[12:12] so you can have very long sequences of
[12:13] integers with very small vocabularies or
[12:16] we can have short um sequences of
[12:20] integers with very large vocabularies
[12:23] and so typically people use in practice
[12:25] these subword encodings but I'd like to
[12:28] keep our token ier very simple so we're
[12:30] using character level tokenizer and that
[12:33] means that we have very small code books
[12:35] we have very simple encode and decode
[12:37] functions uh but we do get very long
[12:40] sequences as a result but that's the
[12:42] level at which we're going to stick with
[12:43] this lecture because it's the simplest
[12:45] thing okay so now that we have an
[12:46] encoder and a decoder effectively a
[12:49] tokenizer we can tokenize the entire
[12:51] training set of Shakespeare so here's a
[12:53] chunk of code that does that and I'm
[12:55] going to start to use the pytorch
[12:56] library and specifically the torch.
[12:58] tensor from the pytorch library so we're
[13:01] going to take all of the text in tiny
[13:03] Shakespeare encode it and then wrap it
[13:05] into a torch. tensor to get the data
[13:08] tensor so here's what the data tensor
[13:10] looks like when I look at just the first
[13:12] 1,000 characters or the 1,000 elements
[13:14] of it so we see that we have a massive
[13:16] sequence of integers and this sequence
[13:18] of integers here is basically an
[13:20] identical translation of the first
[13:22] 10,000 characters
[13:24] here so I believe for example that zero
[13:27] is a new line character and maybe one
[13:29] one is a space not 100% sure but from
[13:32] now on the entire data set of text is
[13:34] re-represented as just it's just
[13:35] stretched out as a single very large uh
[13:38] sequence of
[13:39] integers let me do one more thing before
[13:41] we move on here I'd like to separate out
[13:43] our data set into a train and a
[13:45] validation split so in particular we're
[13:48] going to take the first 90% of the data
[13:51] set and consider that to be the training
[13:52] data for the Transformer and we're going
[13:54] to withhold the last 10% at the end of
[13:56] it to be the validation data and this
[13:59] will help us understand to what extent
[14:01] our model is overfitting so we're going
[14:03] to basically hide and keep the
[14:04] validation data on the side because we
[14:06] don't want just a perfect memorization
[14:08] of this exact Shakespeare we want a
[14:11] neural network that sort of creates
[14:12] Shakespeare like uh text and so it
[14:15] should be fairly likely for it to
[14:17] produce the actual like stowed away uh
[14:21] true Shakespeare text um and so we're
[14:24] going to use this to uh get a sense of
[14:26] the overfitting okay so now we would
[14:28] like to start plugging these text
[14:30] sequences or integer sequences into the
[14:32] Transformer so that it can train and
[14:34] learn those patterns now the important
[14:36] thing to realize is we're never going to
[14:38] actually feed entire text into a
[14:40] Transformer all at once that would be
[14:42] computationally very expensive and
[14:44] prohibitive so when we actually train a
[14:46] Transformer on a lot of these data sets
[14:48] we only work with chunks of the data set
[14:50] and when we train the Transformer we
[14:52] basically sample random little chunks
[14:53] out of the training set and train on
[14:55] just chunks at a time and these chunks
[14:58] have basically some kind of a length and
[15:01] some maximum length now the maximum
[15:04] length typically at least in the code I
[15:06] usually write is called block size you
[15:08] can you can uh find it under different
[15:10] names like context length or something
[15:12] like that let's start with the block
[15:14] size of just eight and let me look at
[15:16] the first train data characters the
[15:18] first block size plus one characters
[15:20] I'll explain why plus one in a
[15:22] second so this is the first nine
[15:24] characters in the sequence in the
[15:27] training set now what I'd like to point
[15:30] out is that when you sample a chunk of
[15:31] data like this so say the these nine
[15:34] characters out of the training set this
[15:36] actually has multiple examples packed
[15:38] into it and uh that's because all of
[15:41] these characters follow each other and
[15:43] so what this thing is going to say when
[15:47] we plug it into a Transformer is we're
[15:49] going to actually simultaneously train
[15:50] it to make prediction at every one of
[15:52] these
[15:53] positions now in the in a chunk of nine
[15:56] characters there's actually eight indiv
[15:58] ual examples packed in there so there's
[16:01] the example that when 18 when in the
[16:04] context of 18 47 likely comes next in a
[16:08] context of 18 and 47 56 comes next in a
[16:12] context of 18 47 56 57 can come next and
[16:16] so on so that's the eight individual
[16:18] examples let me actually spell it out
[16:20] with
[16:21] code so here's a chunk of code to
[16:24] illustrate X are the inputs to the
[16:26] Transformer it will just be the first
[16:28] block size characters y will be the uh
[16:32] next block size characters so it's
[16:34] offset by one and that's because y are
[16:37] the targets for each position in the
[16:40] input and then here I'm iterating over
[16:42] all the block size of eight and the
[16:45] context is always all the characters in
[16:47] x uh up to T and including T and the
[16:51] target is always the teth character but
[16:53] in the targets array y so let me just
[16:56] run
[16:57] this and basically it spells out what I
[16:59] said in words uh these are the eight
[17:02] examples hidden in a chunk of nine
[17:04] characters that we uh sampled from the
[17:08] training set I want to mention one more
[17:11] thing we train on all the eight examples
[17:14] here with context between one all the
[17:16] way up to context of block size and we
[17:19] train on that not just for computational
[17:20] reasons because we happen to have the
[17:22] sequence already or something like that
[17:23] it's not just done for efficiency it's
[17:26] also done um to make the Transformer
[17:28] Network be used to seeing contexts all
[17:32] the way from as little as one all the
[17:33] way to block size and we'd like the
[17:36] transform to be used to seeing
[17:38] everything in between and that's going
[17:39] to be useful later during inference
[17:41] because while we're sampling we can
[17:43] start the sampling generation with as
[17:45] little as one character of context and
[17:47] the Transformer knows how to predict the
[17:49] next character with all the way up to
[17:51] just context of one and so then it can
[17:53] predict everything up to block size and
[17:55] after block size we have to start
[17:56] truncating because the Transformer will
[17:58] will never um receive more than block
[18:01] size inputs when it's predicting the
[18:03] next
[18:03] character Okay so we've looked at the
[18:06] time dimension of the tensors that are
[18:07] going to be feeding into the Transformer
[18:09] there's one more Dimension to care about
[18:11] and that is the batch Dimension and so
[18:13] as we're sampling these chunks of text
[18:15] we're going to be actually every time
[18:17] we're going to feed them into a
[18:18] Transformer we're going to have many
[18:20] batches of multiple chunks of text that
[18:22] are all like stacked up in a single
[18:23] tensor and that's just done for
[18:25] efficiency just so that we can keep the
[18:27] gpus busy uh because they are very good
[18:29] at parallel processing of um of data and
[18:33] so we just want to process multiple
[18:35] chunks all at the same time but those
[18:37] chunks are processed completely
[18:38] independently they don't talk to each
[18:39] other and so on so let me basically just
[18:42] generalize this and introduce a batch
[18:44] Dimension here's a chunk of
[18:46] code let me just run it and then I'm
[18:48] going to explain what it
[18:50] does so here because we're going to
[18:52] start sampling random locations in the
[18:54] data set to pull chunks from I am
[18:57] setting the seed so that um in the
[19:00] random number generator so that the
[19:01] numbers I see here are going to be the
[19:02] same numbers you see later if you try to
[19:04] reproduce this now the batch size here
[19:07] is how many independent sequences we are
[19:09] processing every forward backward pass
[19:11] of the
[19:12] Transformer the block size as I
[19:14] explained is the maximum context length
[19:16] to make those predictions so let's say B
[19:19] size four block size eight and then
[19:21] here's how we get batch for any
[19:23] arbitrary split if the split is a
[19:25] training split then we're going to look
[19:26] at train data otherwise at valid data
[19:30] that gives us the data array and then
[19:33] when I Generate random positions to grab
[19:35] a chunk out of I actually grab I
[19:38] actually generate batch size number of
[19:41] Random offsets so because this is four
[19:44] we are ex is going to be a uh four
[19:47] numbers that are randomly generated
[19:49] between zero and Len of data minus block
[19:51] size so it's just random offsets into
[19:53] the training
[19:54] set and then X's as I explained are the
[19:58] first first block size characters
[20:00] starting at I the Y's are the offset by
[20:05] one of that so just add plus one and
[20:08] then we're going to get those chunks for
[20:10] every one of integers I INX and use a
[20:14] torch. stack to take all those uh uh
[20:17] one-dimensional tensors as we saw here
[20:20] and we're going to um stack them up at
[20:24] rows and so they all become a row in a
[20:27] 4x8 tensor
[20:29] so here's where I'm printing then when I
[20:32] sample a batch XB and YB the inputs to
[20:35] the Transformer now are the input X is
[20:39] the 4x8 tensor four uh rows of eight
[20:44] columns and each one of these is a chunk
[20:47] of the training
[20:48] set and then the targets here are in the
[20:52] associated array Y and they will come in
[20:54] to the Transformer all the way at the
[20:55] end uh to um create the loss function
[20:59] uh so they will give us the correct
[21:01] answer for every single position inside
[21:03] X and then these are the four
[21:06] independent
[21:07] rows so spelled out as we did
[21:11] before uh this 4x8 array contains a
[21:14] total of 32 examples and they're
[21:17] completely independent as far as the
[21:19] Transformer is
[21:20] concerned uh so when the input is 24 the
[21:25] target is 43 or rather 43 here in the Y
[21:28] array
[21:29] when the input is 2443 the target is
[21:31] 58 uh when the input is 24 43 58 the
[21:34] target is 5 Etc or like when it is a 52
[21:38] 581 the target is
[21:40] 58 right so you can sort of see this
[21:43] spelled out these are the 32 independent
[21:45] examples packed in to a single batch of
[21:48] the input X and then the desired targets
[21:51] are in y and so now this integer tensor
[21:57] of um X is going to feed into the
[22:00] Transformer and that Transformer is
[22:02] going to simultaneously process all
[22:04] these examples and then look up the
[22:06] correct um integers to predict in every
[22:08] one of these positions in the tensor y
[22:11] okay so now that we have our batch of
[22:13] input that we'd like to feed into a
[22:15] Transformer let's start basically
[22:16] feeding this into neural networks now
[22:19] we're going to start off with the
[22:20] simplest possible neural network which
[22:22] in the case of language modeling in my
[22:23] opinion is the Byram language model and
[22:25] we've covered the Byram language model
[22:26] in my make more series in a lot of depth
[22:29] and so here I'm going to sort of go
[22:31] faster and let's just Implement pytorch
[22:33] module directly that implements the byr
[22:36] language
[22:36] model so I'm importing the pytorch um NN
[22:41] module uh for
[22:43] reproducibility and then here I'm
[22:44] constructing a Byram language model
[22:46] which is a subass of NN
[22:48] module and then I'm calling it and I'm
[22:51] passing it the inputs and the targets
[22:53] and I'm just printing now when the
[22:55] inputs on targets come here you see that
[22:57] I'm just taking the index uh the inputs
[23:00] X here which I rename to idx and I'm
[23:03] just passing them into this token
[23:04] embedding table so it's going on here is
[23:07] that here in the Constructor we are
[23:09] creating a token embedding table and it
[23:12] is of size vocap size by vocap
[23:15] size and we're using an. embedding which
[23:18] is a very thin wrapper around basically
[23:20] a tensor of shape voap size by vocab
[23:23] size and what's happening here is that
[23:25] when we pass idx here every single
[23:28] integer in our input is going to refer
[23:30] to this embedding table and it's going
[23:32] to pluck out a row of that embedding
[23:34] table corresponding to its index so 24
[23:37] here will go into the embedding table
[23:39] and we'll pluck out the 24th row and
[23:42] then 43 will go here and pluck out the
[23:44] 43d row Etc and then pytorch is going to
[23:47] arrange all of this into a batch by Time
[23:50] by channel uh tensor in this case batch
[23:53] is four time is eight and C which is the
[23:57] channels is vocab size or 65 and so
[24:01] we're just going to pluck out all those
[24:02] rows arrange them in a b by T by C and
[24:05] now we're going to interpret this as the
[24:07] logits which are basically the scores
[24:10] for the next character in the sequence
[24:12] and so what's happening here is we are
[24:14] predicting what comes next based on just
[24:17] the individual identity of a single
[24:19] token and you can do that because um I
[24:22] mean currently the tokens are not
[24:23] talking to each other and they're not
[24:25] seeing any context except for they're
[24:26] just seeing themselves so I'm a f I'm a
[24:29] token number five and then I can
[24:32] actually make pretty decent predictions
[24:33] about what comes next just by knowing
[24:35] that I'm token five because some
[24:37] characters uh know um C follow other
[24:39] characters in in typical scenarios so we
[24:42] saw a lot of this in a lot more depth in
[24:44] the make more series and here if I just
[24:46] run this then we currently get the
[24:49] predictions the scores the lits for
[24:53] every one of the 4x8 positions now that
[24:55] we've made predictions about what comes
[24:57] next we'd like to evaluate the loss
[24:58] function and so in make more series we
[25:00] saw that a good way to measure a loss or
[25:03] like a quality of the predictions is to
[25:05] use the negative log likelihood loss
[25:07] which is also implemented in pytorch
[25:09] under the name cross entropy so what we'
[25:12] like to do here is loss is the cross
[25:15] entropy on the predictions and the
[25:17] targets and so this measures the quality
[25:20] of the logits with respect to the
[25:21] Targets in other words we have the
[25:24] identity of the next character so how
[25:26] well are we predicting the next
[25:28] character based on the lits and
[25:30] intuitively the correct um the correct
[25:33] dimension of low jits uh depending on
[25:36] whatever the target is should have a
[25:38] very high number and all the other
[25:39] dimensions should be very low number
[25:41] right now the issue is that this won't
[25:44] actually this is what we want we want to
[25:46] basically output the logits and the
[25:50] loss this is what we want but
[25:52] unfortunately uh this won't actually run
[25:55] we get an error message but intuitively
[25:57] we want to uh measure this now when we
[26:01] go to the pytorch um cross entropy
[26:04] documentation here um we're trying to
[26:08] call the cross entropy in its functional
[26:10] form uh so that means we don't have to
[26:11] create like a module for it but here
[26:14] when we go to the documentation you have
[26:16] to look into the details of how pitor
[26:18] expects these inputs and basically the
[26:20] issue here is ptor expects if you have
[26:24] multi-dimensional input which we do
[26:25] because we have a b BYT by C tensor then
[26:28] it actually really wants the channels to
[26:31] be the second uh Dimension here so if
[26:35] you um so basically it wants a b by C
[26:38] BYT instead of a b by T by C and so it's
[26:42] just the details of how P torch treats
[26:45] um these kinds of inputs and so we don't
[26:49] actually want to deal with that so what
[26:51] we're going to do instead is we need to
[26:52] basically reshape our logits so here's
[26:54] what I like to do I like to take
[26:56] basically give names to the dimensions
[26:58] so lit. shape is B BYT by C and unpack
[27:01] those numbers and then let's uh say that
[27:04] logits equals lit. View and we want it
[27:07] to be a b * c b * T by C so just a two-
[27:11] dimensional
[27:12] array right so we're going to take all
[27:15] the we're going to take all of these um
[27:18] positions here and we're going to uh
[27:20] stretch them out in a onedimensional
[27:22] sequence and uh preserve the channel
[27:25] Dimension as the second
[27:26] dimension so we're just kind of like
[27:28] stretching out the array so it's two-
[27:29] dimensional and in that case it's going
[27:31] to better conform to what pytorch uh
[27:33] sort of expects in its Dimensions now we
[27:36] have to do the same to targets because
[27:38] currently targets are um of shape B by T
[27:44] and we want it to be just B * T so
[27:47] onedimensional now alternatively you
[27:49] could always still just do minus one
[27:51] because pytor will guess what this
[27:53] should be if you want to lay it out uh
[27:55] but let me just be explicit and say p *
[27:57] t once we've reshaped this it will match
[28:00] the cross entropy case and then we
[28:03] should be able to evaluate our
[28:06] loss okay so that R now and we can do
[28:10] loss and So currently we see that the
[28:12] loss is
[28:13] 4.87 now because our uh we have 65
[28:17] possible vocabulary elements we can
[28:19] actually guess at what the loss should
[28:20] be and in
[28:22] particular we covered negative log
[28:24] likelihood in a lot of detail we are
[28:26] expecting log or lawn of um 1 over 65
[28:32] and negative of that so we're expecting
[28:34] the loss to be about 4.1 17 but we're
[28:37] getting 4.87 and so that's telling us
[28:40] that the initial predictions are not uh
[28:42] super diffuse they've got a little bit
[28:43] of entropy and so we're guessing wrong
[28:47] uh so uh yes but actually we're I a we
[28:50] are able to evaluate the loss okay so
[28:53] now that we can evaluate the quality of
[28:54] the model on some data we'd like to also
[28:57] be able to generate from the model so
[28:59] let's do the generation now I'm going to
[29:01] go again a little bit faster here
[29:03] because I covered all this already in
[29:04] previous
[29:05] videos
[29:07] so here's a generate function for the
[29:11] model so we take some uh we take the the
[29:15] same kind of input idx here and
[29:18] basically this is the current uh context
[29:22] of some characters in a batch in some
[29:24] batch so it's also B BYT and the job of
[29:28] generate is to basically take this B BYT
[29:30] and extend it to be B BYT + 1 plus 2
[29:32] plus 3 and so it's just basically it
[29:34] continues the generation in all the
[29:36] batch dimensions in the time Dimension
[29:39] So that's its job and it will do that
[29:41] for Max new tokens so you can see here
[29:43] on the bottom there's going to be some
[29:45] stuff here but on the bottom whatever is
[29:47] predicted is concatenated on top of the
[29:50] previous idx along the First Dimension
[29:53] which is the time Dimension to create a
[29:54] b BYT + one so that becomes a new idx so
[29:58] the job of generate is to take a b BYT
[30:00] and make it a b BYT plus 1 plus 2 plus
[30:02] three as many as we want Max new tokens
[30:05] so this is the generation from the model
[30:08] now inside the generation what what are
[30:10] we doing we're taking the current
[30:11] indices we're getting the predictions so
[30:15] we get uh those are in the low jits and
[30:18] then the loss here is going to be
[30:19] ignored because um we're not we're not
[30:21] using that and we have no targets that
[30:23] are sort of ground truth targets that
[30:25] we're going to be comparing with
[30:28] then once we get the logits we are only
[30:30] focusing on the last step so instead of
[30:33] a b by T by C we're going to pluck out
[30:36] the negative-1 the last element in the
[30:38] time Dimension because those are the
[30:40] predictions for what comes next so that
[30:42] gives us the logits which we then
[30:44] convert to probabilities via softmax and
[30:47] then we use tor. multinomial to sample
[30:49] from those probabilities and we ask
[30:51] pytorch to give us one sample and so idx
[30:54] next will become a b by one because in
[30:57] each uh one of the batch Dimensions
[31:00] we're going to have a single prediction
[31:01] for what comes next so this num samples
[31:03] equals one will make this be a
[31:06] one and then we're going to take those
[31:08] integers that come from the sampling
[31:10] process according to the probability
[31:11] distribution given here and those
[31:13] integers got just concatenated on top of
[31:15] the current sort of like running stream
[31:17] of integers and this gives us a b BYT +
[31:20] one and then we can return that now one
[31:24] thing here is you see how I'm calling
[31:26] self of idx which will end up going to
[31:29] the forward function I'm not providing
[31:31] any Targets So currently this would give
[31:33] an error because targets is uh is uh
[31:36] sort of like not given so targets has to
[31:39] be optional so targets is none by
[31:41] default and then if targets is none then
[31:44] there's no loss to create so it's just
[31:47] loss is none but else all of this
[31:50] happens and we can create a loss so this
[31:53] will make it so um if we have the
[31:56] targets we provide them and get a loss
[31:57] if we have no targets it will'll just
[31:59] get the
[32:00] loits so this here will generate from
[32:02] the model um and let's take that for a
[32:06] ride
[32:08] now oops so I have another code chunk
[32:11] here which will generate for the model
[32:13] from the model and okay this is kind of
[32:15] crazy so maybe let me let me break this
[32:18] down so these are the idx
[32:23] right I'm creating a batch will be just
[32:26] one time will be just one so I'm
[32:30] creating a little one by one tensor and
[32:32] it's holding a zero and the D type the
[32:35] data type is uh integer so zero is going
[32:38] to be how we kick off the generation and
[32:40] remember that zero is uh is the element
[32:44] standing for a new line character so
[32:45] it's kind of like a reasonable thing to
[32:47] to feed in as the very first character
[32:49] in a sequence to be the new
[32:51] line um so it's going to be idx which
[32:54] we're going to feed in here then we're
[32:56] going to ask for 100 tokens
[32:58] and then. generate will continue that
[33:01] now because uh generate works on the
[33:05] level of batches we we then have to
[33:07] index into the zero throw to basically
[33:09] unplug the um the single batch Dimension
[33:13] that exists and then that gives us a um
[33:18] time steps just a onedimensional array
[33:20] of all the indices which we will convert
[33:23] to simple python list from pytorch
[33:26] tensor so that that can feed into our
[33:28] decode function and uh convert those
[33:32] integers into text so let me bring this
[33:34] back and we're generating 100 tokens
[33:37] let's
[33:37] run and uh here's the generation that we
[33:40] achieved so obviously it's garbage and
[33:43] the reason it's garbage is because this
[33:44] is a totally random model so next up
[33:47] we're going to want to train this model
[33:49] now one more thing I wanted to point out
[33:50] here is this function is written to be
[33:53] General but it's kind of like ridiculous
[33:55] right now because
[33:58] we're feeding in all this we're building
[33:59] out this context and we're concatenating
[34:02] it all and we're always feeding it all
[34:05] into the model but that's kind of
[34:07] ridiculous because this is just a simple
[34:09] Byram model so to make for example this
[34:11] prediction about K we only needed this W
[34:14] but actually what we fed into the model
[34:15] is we fed the entire sequence and then
[34:18] we only looked at the very last piece
[34:20] and predicted K so the only reason I'm
[34:23] writing it in this way is because right
[34:25] now this is a byr model but I'd like to
[34:27] keep keep this function fixed and I'd
[34:29] like it to work um later when our
[34:32] characters actually um basically look
[34:35] further in the history and so right now
[34:37] the history is not used so this looks
[34:39] silly uh but eventually the history will
[34:42] be used and so that's why we want to uh
[34:44] do it this way so just a quick comment
[34:46] on that so now we see that this is um
[34:49] random so let's train the model so it
[34:51] becomes a bit less random okay let's Now
[34:53] train the model so first what I'm going
[34:55] to do is I'm going to create a pyour
[34:57] optimization object so here we are using
[35:00] the optimizer ATM W um now in a make
[35:05] more series we've only ever use tastic
[35:06] gradi in descent the simplest possible
[35:08] Optimizer which you can get using the
[35:10] SGD instead but I want to use Adam which
[35:12] is a much more advanced and popular
[35:14] Optimizer and it works extremely well
[35:16] for uh typical good setting for the
[35:19] learning rate is roughly 3 E4 uh but for
[35:22] very very small networks like is the
[35:23] case here you can get away with much
[35:25] much higher learning rates R3 or even
[35:28] higher probably but let me create the
[35:30] optimizer object which will basically
[35:33] take the gradients and uh update the
[35:35] parameters using the
[35:36] gradients and then here our batch size
[35:40] up above was only four so let me
[35:41] actually use something bigger let's say
[35:43] 32 and then for some number of steps um
[35:46] we are sampling a new batch of data
[35:48] we're evaluating the loss uh we're
[35:51] zeroing out all the gradients from the
[35:52] previous step getting the gradients for
[35:54] all the parameters and then using those
[35:56] gradients to up update our parameters so
[35:58] typical training loop as we saw in the
[36:00] make more series so let me now uh run
[36:04] this for say 100 iterations and let's
[36:07] see what kind of losses we're going to
[36:09] get so we started around
[36:12] 4.7 and now we're getting to down to
[36:14] like 4.6 4.5 Etc so the optimization is
[36:18] definitely happening but um let's uh
[36:22] sort of try to increase number of
[36:23] iterations and only print at the
[36:25] end because we probably want train for
[36:29] longer okay so we're down to 3.6
[36:34] roughly roughly down to
[36:40] three this is the most janky
[36:46] optimization okay it's working let's
[36:48] just do
[36:50] 10,000 and then from here we want to
[36:53] copy this and hopefully that we're going
[36:56] to get something reason and of course
[36:58] it's not going to be Shakespeare from a
[37:00] byr model but at least we see that the
[37:01] loss is improving and uh hopefully we're
[37:05] expecting something a bit more
[37:06] reasonable okay so we're down at about
[37:08] 2.5 is let's see what we get okay
[37:12] dramatic improvements certainly on what
[37:14] we had here so let me just increase the
[37:17] number of tokens okay so we see that
[37:19] we're starting to get something at least
[37:21] like reasonable is
[37:25] um certainly not shakes spear but uh the
[37:29] model is making progress so that is the
[37:31] simplest possible
[37:33] model so now what I'd like to do
[37:36] is obviously this is a very simple model
[37:39] because the tokens are not talking to
[37:41] each other so given the previous context
[37:43] of whatever was generated we're only
[37:45] looking at the very last character to
[37:46] make the predictions about what comes
[37:48] next so now these uh now these tokens
[37:50] have to start talking to each other and
[37:53] figuring out what is in the context so
[37:55] that they can make better predictions
[37:56] for what comes next and this is how
[37:57] we're going to kick off the uh
[37:59] Transformer okay so next I took the code
[38:02] that we developed in this juper notebook
[38:03] and I converted it to be a script and
[38:05] I'm doing this because I just want to
[38:08] simplify our intermediate work into just
[38:10] the final product that we have at this
[38:12] point so in the top here I put all the
[38:15] hyp parameters that we to find I
[38:16] introduced a few and I'm going to speak
[38:18] to that in a little bit otherwise a lot
[38:20] of this should be recognizable uh
[38:23] reproducibility read data get the
[38:25] encoder and the decoder create the train
[38:27] into splits uh use the uh kind of like
[38:30] data loader um that gets a batch of the
[38:34] inputs and Targets this is new and I'll
[38:36] talk about it in a second now this is
[38:39] the Byram language model that we
[38:40] developed and it can forward and give us
[38:43] a logits and loss and it can
[38:45] generate and then here we are creating
[38:48] the optimizer and this is the training
[38:51] Loop so everything here should look
[38:53] pretty familiar now some of the small
[38:55] things that I added number one I added
[38:57] the ability to run on a GPU if you have
[39:00] it so if you have a GPU then you can
[39:02] this will use Cuda instead of just CPU
[39:04] and everything will be a lot more faster
[39:07] now when device becomes Cuda then we
[39:09] need to make sure that when we load the
[39:11] data we move it to
[39:13] device when we create the model we want
[39:15] to move uh the model parameters to
[39:18] device so as an example here we have the
[39:21] N an embedding table and it's got a
[39:23] weight inside it which stores the uh
[39:26] sort of lookup table so so that would be
[39:27] moved to the GPU so that all the
[39:29] calculations here happen on the GPU and
[39:32] they can be a lot faster and then
[39:34] finally here when I'm creating the
[39:35] context that feeds in to generate I have
[39:37] to make sure that I create it on the
[39:39] device number two what I introduced is
[39:43] uh the fact that here in the training
[39:46] Loop here I was just printing the um l.
[39:50] item inside the training Loop but this
[39:53] is a very noisy measurement of the
[39:54] current loss because every batch will be
[39:56] more or less lucky and so what I want to
[39:59] do usually um is uh I have an estimate
[40:02] loss function and the estimate loss
[40:05] basically then um goes up here and it
[40:10] averages up the loss over multiple
[40:12] batches so in particular we're going to
[40:15] iterate eval iter times and we're going
[40:17] to basically get our loss and then we're
[40:19] going to get the average loss for both
[40:21] splits and so this will be a lot less
[40:24] noisy so here when we call the estimate
[40:26] loss we're we're going to report the uh
[40:28] pretty accurate train and validation
[40:31] loss now when we come back up you'll
[40:33] notice a few things here I'm setting the
[40:35] model to evaluation phase and down here
[40:38] I'm resetting it back to training phase
[40:40] now right now for our model as is this
[40:42] doesn't actually do anything because the
[40:44] only thing inside this model is this uh
[40:46] nn. embedding and um this this um
[40:51] Network would behave both would behave
[40:53] the same in both evaluation mode and
[40:55] training mode we have no drop off layers
[40:57] we have no batm layers Etc but it is a
[41:00] good practice to Think Through what mode
[41:02] your neural network is in because some
[41:04] layers will have different Behavior Uh
[41:07] at inference time or training time and
[41:11] there's also this context manager torch
[41:12] up nograd and this is just telling
[41:14] pytorch that everything that happens
[41:16] inside this function we will not call do
[41:18] backward on and so pytorch can be a lot
[41:21] more efficient with its memory use
[41:23] because it doesn't have to store all the
[41:25] intermediate variables uh because we're
[41:27] never going to call backward and so it
[41:29] can it can be a lot more memory
[41:30] efficient in that way so also a good
[41:32] practice to tpy torch when we don't
[41:35] intend to do back
[41:36] propagation so right now this script is
[41:39] about 120 lines of code of and that's
[41:43] kind of our starter code I'm calling it
[41:45] b.p and I'm going to release it later
[41:48] now running this
[41:50] script gives us output in the terminal
[41:52] and it looks something like this it
[41:54] basically as I ran this code uh it was
[41:57] giving me the train loss and Val loss
[41:59] and we see that we convert to somewhere
[42:01] around
[42:01] 2.5 with the pyr model and then here's
[42:04] the sample that we produced at the
[42:07] end and so we have everything packaged
[42:09] up in the script and we're in a good
[42:11] position now to iterate on this okay so
[42:13] we are almost ready to start writing our
[42:15] very first self attention block for
[42:18] processing these uh tokens now before we
[42:22] actually get there I want to get you
[42:24] used to a mathematical trick that is
[42:26] used in the self attention inside a
[42:28] Transformer and is really just like at
[42:30] the heart of an an efficient
[42:32] implementation of self attention and so
[42:34] I want to work with this toy example to
[42:36] just get you used to this operation and
[42:38] then it's going to make it much more
[42:39] clear once we actually get to um to it
[42:43] uh in the script
[42:44] again so let's create a b BYT by C where
[42:47] BT and C are just 48 and two in the toy
[42:50] example and these are basically channels
[42:53] and we have uh batches and we have the
[42:55] time component and we have information
[42:58] at each point in the sequence so
[43:01] see now what we would like to do is we
[43:03] would like these um tokens so we have up
[43:06] to eight tokens here in a batch and
[43:08] these eight tokens are currently not
[43:10] talking to each other and we would like
[43:11] them to talk to each other we'd like to
[43:13] couple them and in particular we don't
[43:17] we we want to couple them in a very
[43:18] specific way so the token for example at
[43:21] the fifth location it should not
[43:23] communicate with tokens in the sixth
[43:25] seventh and eighth location
[43:27] because uh those are future tokens in
[43:29] the sequence the token on the fifth
[43:31] location should only talk to the one in
[43:33] the fourth third second and first so
[43:36] it's only so information only flows from
[43:38] previous context to the current time
[43:40] step and we cannot get any information
[43:42] from the future because we are about to
[43:44] try to predict the
[43:45] future so what is the easiest way for
[43:49] tokens to communicate okay the easiest
[43:52] way I would say is okay if we're up to
[43:54] if we're a fifth token and I'd like to
[43:56] communicate with my past the simplest
[43:58] way we can do that is to just do a
[44:00] weight is to just do an average of all
[44:03] the um of all the preceding elements so
[44:06] for example if I'm the fif token I would
[44:08] like to take the channels uh that make
[44:10] up that are information at my step but
[44:13] then also the channels from the fourth
[44:15] step third step second step and the
[44:17] first step I'd like to average those up
[44:19] and then that would become sort of like
[44:21] a feature Vector that summarizes me in
[44:23] the context of my history now of course
[44:26] just doing a sum or like an average is
[44:28] an extremely weak form of interaction
[44:30] like this communication is uh extremely
[44:32] lossy we've lost a ton of information
[44:34] about the spatial Arrangements of all
[44:35] those tokens uh but that's okay for now
[44:38] we'll see how we can bring that
[44:39] information back later for now what we
[44:41] would like to do is for every single
[44:43] batch element independently for every
[44:46] teeth token in that sequence we'd like
[44:49] to now calculate the average of all the
[44:53] vectors in all the previous tokens and
[44:55] also at this token so let's write that
[44:58] out um I have a small snippet here and
[45:01] instead of just fumbling around let me
[45:03] just copy paste it and talk to
[45:05] it so in other words we're going to
[45:08] create X and B is short for bag of words
[45:12] because bag of words is um is kind of
[45:15] like um a term that people use when you
[45:17] are just averaging up things so this is
[45:19] just a bag of words basically there's a
[45:21] word stored on every one of these eight
[45:23] locations and we're doing a bag of words
[45:25] we're just averaging
[45:27] so in the beginning we're going to say
[45:28] that it's just initialized at Zero and
[45:30] then I'm doing a for Loop here so we're
[45:32] not being efficient yet that's coming
[45:34] but for now we're just iterating over
[45:36] all the batch Dimensions independently
[45:38] iterating over time and then the
[45:40] previous uh tokens are at this uh batch
[45:45] Dimension and then everything up to and
[45:47] including the teeth token okay so when
[45:51] we slice out X in this way X prev
[45:54] Becomes of shape um how many T elements
[45:58] there were in the past and then of
[46:00] course C so all the two-dimensional
[46:02] information from these little tokens so
[46:05] that's the previous uh sort of chunk of
[46:08] um tokens from my current sequence and
[46:12] then I'm just doing the average or the
[46:13] mean over the zero Dimension so I'm
[46:15] averaging out the time here and I'm just
[46:19] going to get a little c one dimensional
[46:21] Vector which I'm going to store in X bag
[46:23] of words so I can run this and and uh
[46:27] this is not going to be very informative
[46:30] because let's see so this is X of Zer so
[46:32] this is the zeroth batch element and
[46:35] then expo at zero now you see how the at
[46:40] the first location here you see that the
[46:42] two are equal and that's because it's
[46:45] we're just doing an average of this one
[46:46] token but here this one is now an
[46:49] average of these two and now this one is
[46:53] an average of these
[46:54] three and so on
[46:57] so uh and this last one is the average
[47:01] of all of these elements so vertical
[47:03] average just averaging up all the tokens
[47:05] now gives this outcome
[47:07] here so this is all well and good uh but
[47:10] this is very inefficient now the trick
[47:12] is that we can be very very efficient
[47:14] about doing this using matrix
[47:16] multiplication so that's the
[47:18] mathematical trick and let me show you
[47:19] what I mean let's work with the toy
[47:21] example here let me run it and I'll
[47:24] explain I have a simple Matrix here that
[47:27] is a 3X3 of all ones a matrix B of just
[47:31] random numbers and it's a 3x2 and a
[47:33] matrix C which will be 3x3 multip 3x2
[47:36] which will give out a 3x2 so here we're
[47:39] just using um matrix multiplication so a
[47:43] multiply B gives us
[47:46] C okay so how are these numbers in C um
[47:51] achieved right so this number in the top
[47:54] left is the first row of a dot product
[47:57] with the First Column of B and since all
[48:00] the the row of a right now is all just
[48:02] ones then the do product here with with
[48:05] this column of B is just going to do a
[48:07] sum of these of this column so 2 + 6 + 6
[48:11] is
[48:12] 14 the element here in the output of C
[48:15] is also the first column here the first
[48:17] row of a multiplied now with the second
[48:20] column of B so 7 + 4 + 5 is 16 now you
[48:25] see that there's repeating elements here
[48:26] so this 14 again is because this row is
[48:28] again all ones and it's multiplying the
[48:30] First Column of B so we get 14 and this
[48:33] one is and so on so this last number
[48:35] here is the last row do product last
[48:39] column now the trick here is uh the
[48:42] following this is just a boring number
[48:44] of um it's just a boring array of all
[48:48] ones but torch has this function called
[48:50] Trail which is short for a
[48:54] triangular uh something like that and
[48:56] you can wrap it in torch up once and it
[48:58] will just return the lower triangular
[49:00] portion of this
[49:03] okay so now it will basically zero out
[49:06] uh these guys here so we just get the
[49:08] lower triangular part well what happens
[49:10] if we do
[49:14] that so now we'll have a like this and B
[49:17] like this and now what are we getting
[49:18] here in C well what is this number well
[49:22] this is the first row times the First
[49:24] Column and because this is zeros
[49:28] uh these elements here are now ignored
[49:30] so we just get a two and then this
[49:32] number here is the first row times the
[49:35] second column and because these are
[49:37] zeros they get ignored and it's just
[49:39] seven this seven multiplies this one but
[49:42] look what happened here because this is
[49:43] one and then zeros we what ended up
[49:46] happening is we're just plucking out the
[49:48] row of this row of B and that's what we
[49:51] got now here we have one 1 Z so here 110
[49:57] do product with these two columns will
[49:59] now give us 2 + 6 which is 8 and 7 + 4
[50:02] which is 11 and because this is 111 we
[50:05] ended up with the addition of all of
[50:07] them and so basically depending on how
[50:10] many ones and zeros we have here we are
[50:12] basically doing a sum currently of a
[50:16] variable number of these rows and that
[50:18] gets deposited into
[50:20] C So currently we're doing sums because
[50:23] these are ones but we can also do
[50:25] average right and you can start to see
[50:27] how we could do average uh of the rows
[50:29] of B uh sort of in an incremental
[50:32] fashion because we don't have to we can
[50:35] basically normalize these rows so that
[50:37] they sum to one and then we're going to
[50:39] get an average so if we took a and then
[50:41] we did aals
[50:43] aide torch. sum in the um of a in the um
[50:51] oneth Dimension and then let's keep them
[50:55] as true so so therefore the broadcasting
[50:57] will work out so if I rerun this you see
[51:00] now that these rows now sum to one so
[51:04] this row is one this row is 0. 5.5 Z and
[51:07] here we get 1/3 and now when we do a
[51:09] multiply B what are we getting here we
[51:12] are just getting the first row first row
[51:15] here now we are getting the average of
[51:18] the first two
[51:20] rows okay so 2 and six average is four
[51:23] and four and seven average is
[51:25] 5.5 and on the bottom here we are now
[51:27] getting the average of these three rows
[51:31] so the average of all of elements of B
[51:33] are now deposited here and so you can
[51:36] see that by manipulating these uh
[51:40] elements of this multiplying Matrix and
[51:42] then multiplying it with any given
[51:44] Matrix we can do these averages in this
[51:47] incremental fashion because we just get
[51:50] um and we can manipulate that based on
[51:53] the elements of a okay so that's very
[51:55] convenient so let's let's swing back up
[51:57] here and see how we can vectorize this
[51:59] and make it much more efficient using
[52:00] what we've learned so in
[52:03] particular we are going to produce an
[52:05] array a but here I'm going to call it we
[52:08] short for weights but this is our
[52:11] a and this is how much of every row we
[52:14] want to average up and it's going to be
[52:17] an average because you can see that
[52:18] these rows sum to
[52:20] one so this is our a and then our B in
[52:23] this example of course is X
[52:27] so what's going to happen here now is
[52:29] that we are going to have an expo
[52:31] 2 and this Expo 2 is going to be way
[52:36] multiplying
[52:38] RX so let's think this true way is T BYT
[52:42] and this is Matrix multiplying in
[52:44] pytorch a b by T by
[52:47] C and it's giving us uh different what
[52:50] shape so pytorch will come here and it
[52:52] will see that these shapes are not the
[52:54] same so it will create a batch Dimension
[52:57] here and this is a batched matrix
[53:00] multiply and so it will apply this
[53:02] matrix multiplication in all the batch
[53:04] elements um in parallel and individually
[53:08] and then for each batch element there
[53:09] will be a t BYT multiplying T by C
[53:12] exactly as we had
[53:15] below so this will now create B by T by
[53:20] C and Expo 2 will now become identical
[53:24] to Expo
[53:28] so we can see that torch. all close of
[53:32] xbo and xbo 2 should be true
[53:36] now so this kind of like convinces us
[53:38] that uh these are in fact um the same so
[53:43] xbo and xbo 2 if I just print
[53:47] them uh okay we're not going to be able
[53:49] to okay we're not going to be able to
[53:51] just stare it down but
[53:54] um well let me try Expo basically just
[53:56] at the zeroth element and Expo two at
[53:58] the zeroth element so just the first
[53:59] batch and we should see that this and
[54:02] that should be identical which they
[54:04] are right so what happened here the
[54:07] trick is we were able to use batched
[54:09] Matrix multiply to do this uh
[54:12] aggregation really and it's a weighted
[54:15] aggregation and the weights are
[54:17] specified in this um T BYT array and
[54:21] we're basically doing weighted sums and
[54:24] uh these weighted sums are are U
[54:26] according to uh the weights inside here
[54:28] they take on sort of this triangular
[54:31] form and so that means that a token at
[54:33] the teth dimension will only get uh sort
[54:36] of um information from the um tokens
[54:39] perceiving it so that's exactly what we
[54:41] want and finally I would like to rewrite
[54:43] it in one more way and we're going to
[54:46] see why that's useful so this is the
[54:48] third version and it's also identical to
[54:50] the first and second but let me talk
[54:53] through it it uses
[54:54] softmax so Trill here is this Matrix
[55:00] lower triangular
[55:01] ones way begins as all
[55:05] zero okay so if I just print way in the
[55:07] beginning it's all zero then I
[55:11] used masked fill so what this is doing
[55:15] is we. masked fill it's all zeros and
[55:18] I'm saying for all the elements where
[55:20] Trill is equal equal Z make them be
[55:23] negative Infinity so all the elements
[55:26] where Trill is zero will become negative
[55:28] Infinity now so this is what we get and
[55:32] then the final line here is
[55:36] softmax so if I take a softmax along
[55:38] every single so dim is negative one so
[55:40] along every single row if I do softmax
[55:44] what is that going to
[55:46] do well softmax is um is also like a
[55:51] normalization operation right and so
[55:54] spoiler alert you get the exact same
[55:58] Matrix let me bring back to
[56:00] softmax and recall that in softmax we're
[56:02] going to exponentiate every single one
[56:04] of these and then we're going to divide
[56:06] by the sum and so if we exponentiate
[56:10] every single element here we're going to
[56:11] get a one and here we're going to get uh
[56:14] basically zero 0 z0 Z everywhere else
[56:17] and then when we normalize we just get
[56:19] one here we're going to get one one and
[56:21] then zeros and then softmax will again
[56:24] divide and this will give us 5.5 and so
[56:27] on and so this is also the uh the same
[56:30] way to produce uh this mask now the
[56:33] reason that this is a bit more
[56:34] interesting and the reason we're going
[56:36] to end up using it in self
[56:37] attention is that these weights here
[56:41] begin uh with zero and you can think of
[56:44] this as like an interaction strength or
[56:46] like an affinity so basically it's
[56:49] telling us how much of each uh token
[56:52] from the past do we want to Aggregate
[56:54] and average up
[56:57] and then this line is saying tokens from
[56:59] the past cannot communicate by setting
[57:02] them to negative Infinity we're saying
[57:04] that we will not aggregate anything from
[57:06] those
[57:07] tokens and so basically this then goes
[57:09] through softmax and through the weighted
[57:11] and this is the aggregation through
[57:12] matrix
[57:14] multiplication and so what this is now
[57:16] is you can think of these as um these
[57:19] zeros are currently just set by us to be
[57:21] zero but a quick preview is that these
[57:25] affinities between the tokens are not
[57:27] going to be just constant at zero
[57:29] they're going to be data dependent these
[57:31] tokens are going to start looking at
[57:32] each other and some tokens will find
[57:34] other tokens more or less interesting
[57:37] and depending on what their values are
[57:39] they're going to find each other
[57:41] interesting to different amounts and I'm
[57:42] going to call those affinities I think
[57:45] and then here we are saying the future
[57:47] cannot communicate with the past we're
[57:49] we're going to clamp them and then when
[57:51] we normalize and sum we're going to
[57:53] aggregate uh sort of their values
[57:56] depending on how interesting they find
[57:57] each other and so that's the preview for
[57:59] self attention and basically long story
[58:03] short from this entire section is that
[58:05] you can do weighted aggregations of your
[58:07] past
[58:08] Elements by having by using matrix
[58:12] multiplication of a lower triangular
[58:14] fashion and then the elements here in
[58:17] the lower triangular part are telling
[58:18] you how much of each element uh fuses
[58:21] into this position so we're going to use
[58:24] this trick now to develop the self
[58:25] attention block block so first let's get
[58:27] some quick preliminaries out of the way
[58:30] first the thing I'm kind of bothered by
[58:31] is that you see how we're passing in
[58:33] vocap size into the Constructor there's
[58:35] no need to do that because vocap size is
[58:36] already defined uh up top as a global
[58:38] variable so there's no need to pass this
[58:40] stuff
[58:41] around next what I want to do is I don't
[58:44] want to actually create I want to create
[58:46] like a level of indirection here where
[58:47] we don't directly go to the embedding
[58:49] for the um logits but instead we go
[58:52] through this intermediate phase because
[58:54] we're going to start making that bigger
[58:57] so let me introduce a new variable n
[58:59] embed it shorted for number of embedding
[59:02] Dimensions so
[59:04] nbed here will be say 32 that was a
[59:09] suggestion from GitHub co-pilot by the
[59:11] way um it also suest 32 which is a good
[59:14] number so this is an embedding table and
[59:16] only 32 dimensional
[59:18] embeddings so then here this is not
[59:21] going to give us logits directly instead
[59:23] this is going to give us token
[59:24] embeddings that's I'm going to call it
[59:27] and then to go from the token Tings to
[59:29] the logits we're going to need a linear
[59:30] layer so self. LM head let's call it
[59:34] short for language modeling head is n
[59:36] and linear from n ined up to vocap size
[59:39] and then when we swing over here we're
[59:41] actually going to get the loits by
[59:43] exactly what the co-pilot says now we
[59:46] have to be careful here because this C
[59:48] and this C are not equal um this is nmed
[59:52] C and this is vocap size so let's just
[59:55] say that n ined is equal to
[59:57] C and then this just creates one spous
[60:01] layer of interaction through a linear
[60:02] layer but uh this should basically
[60:11] run so we see that this runs and uh this
[60:15] currently looks kind of spous but uh
[60:17] we're going to build on top of this now
[60:19] next up so far we've taken these indices
[60:22] and we've encoded them based on the
[60:23] identity of the uh tokens in inside idx
[60:28] the next thing that people very often do
[60:30] is that we're not just encoding the
[60:31] identity of these tokens but also their
[60:33] position so we're going to have a second
[60:35] position uh embedding table here so
[60:38] self. position embedding table is an an
[60:41] embedding of block size by an embed and
[60:44] so each position from zero to block size
[60:46] minus one will also get its own
[60:47] embedding vector and then here first let
[60:50] me decode B BYT from idx do
[60:54] shape and then here we're also going to
[60:56] have a pause embedding which is the
[60:58] positional embedding and these are this
[61:00] is to arrange so this will be basically
[61:03] just integers from Z to T minus one and
[61:06] all of those integers from 0 to T minus
[61:08] one get embedded through the table to
[61:09] create a t by
[61:11] C and then here this gets renamed to
[61:14] just say x and x will be the addition of
[61:18] the token embeddings with the positional
[61:20] embeddings and here the broadcasting
[61:22] note will work out so B by T by C plus T
[61:25] by C
[61:26] this gets right aligned a new dimension
[61:28] of one gets added and it gets
[61:30] broadcasted across
[61:31] batch so at this point x holds not just
[61:34] the token identities but the positions
[61:37] at which these tokens occur and this is
[61:39] currently not that useful because of
[61:41] course we just have a simple byr model
[61:43] so it doesn't matter if you're in the
[61:44] fifth position the second position or
[61:46] wherever it's all translation invariant
[61:48] at this stage uh so this information
[61:50] currently wouldn't help uh but as we
[61:52] work on the self attention block we'll
[61:54] see that this starts to matter
[61:59] okay so now we get the Crux of self
[62:01] attention so this is probably the most
[62:03] important part of this video to
[62:05] understand we're going to implement a
[62:07] small self attention for a single
[62:08] individual head as they're called so we
[62:11] start off with where we were so all of
[62:13] this code is familiar so right now I'm
[62:16] working with an example where I Chang
[62:17] the number of channels from 2 to 32 so
[62:20] we have a 4x8 arrangement of tokens and
[62:24] each to and the information each token
[62:26] is currently 32 dimensional but we just
[62:28] are working with random
[62:30] numbers now we saw here that the code as
[62:34] we had it before does a uh simple weight
[62:37] simple average of all the past tokens
[62:41] and the current token so it's just the
[62:43] previous information and current
[62:44] information is just being mixed together
[62:45] in an average and that's what this code
[62:48] currently achieves and it Doo by
[62:50] creating this lower triangular structure
[62:52] which allows us to mask out this uh we
[62:55] uh Matrix that we create so we mask it
[62:59] out and then we normalize it and
[63:01] currently when we initialize the
[63:03] affinities between all the different
[63:05] sort of tokens or nodes I'm going to use
[63:08] those terms
[63:09] interchangeably so when we initialize
[63:11] the affinities between all the different
[63:13] tokens to be zero then we see that way
[63:16] gives us this um structure where every
[63:18] single row has these um uniform numbers
[63:22] and so that's what that's what then uh
[63:25] in this Matrix multiply makes it so that
[63:27] we're doing a simple
[63:28] average now we don't actually want this
[63:32] to be all uniform because different uh
[63:36] tokens will find different other tokens
[63:38] more or less interesting and we want
[63:40] that to be data dependent so for example
[63:42] if I'm a vowel then maybe I'm looking
[63:44] for consonants in my past and maybe I
[63:46] want to know what those consonants are
[63:48] and I want that information to flow to
[63:50] me and so I want to now gather
[63:52] information from the past but I want to
[63:54] do it in the data dependent way and this
[63:56] is the problem that self attention
[63:58] solves now the way self attention solves
[64:00] this is the following every single node
[64:03] or every single token at each position
[64:06] will emit two vectors it will emit a
[64:09] query and it will emit a
[64:12] key now the query Vector roughly
[64:15] speaking is what am I looking for and
[64:18] the key Vector roughly speaking is what
[64:20] do I
[64:21] contain and then the way we get
[64:24] affinities between these uh tokens now
[64:27] in a sequence is we basically just do a
[64:29] do product between the keys and the
[64:31] queries so my query dot products with
[64:35] all the keys of all the other tokens and
[64:37] that dot product now becomes
[64:41] wayy and so um if the key and the query
[64:45] are sort of aligned they will interact
[64:47] to a very high amount and then I will
[64:50] get to learn more about that specific
[64:52] token as opposed to any other token in
[64:55] the sequence
[64:56] so let's implement this
[65:00] now we're going to implement a
[65:03] single what's called head of self
[65:07] attention so this is just one head
[65:09] there's a hyper parameter involved with
[65:10] these heads which is the head size and
[65:13] then here I'm initializing linear
[65:15] modules and I'm using bias equals false
[65:18] so these are just going to apply a
[65:19] matrix multiply with some fixed
[65:21] weights and now let me produce a key and
[65:26] q k and Q by forwarding these modules on
[65:29] X so the size of this will now
[65:32] become B by T by 16 because that is the
[65:36] head size and the same here B by T by
[65:44] 16 so this being the head size so you
[65:47] see here that when I forward this linear
[65:49] on top of my X all the tokens in all the
[65:52] positions in the B BYT Arrangement all
[65:55] of them them in parallel and
[65:57] independently produce a key and a query
[65:59] so no communication has happened
[66:01] yet but the communication comes now all
[66:04] the queries will do product with all the
[66:07] keys so basically what we want is we
[66:09] want way now or the affinities between
[66:12] these to be query multiplying key but we
[66:16] have to be careful with uh we can't
[66:18] Matrix multiply this we actually need to
[66:20] transpose uh K but we have to be also
[66:23] careful because these are when you have
[66:25] The Bash Dimension so in particular we
[66:27] want to transpose uh the last two
[66:30] dimensions dimension1 and dimension -2
[66:33] so
[66:36] -21 and so this Matrix multiply now will
[66:40] basically do the following B by T by
[66:44] 16 Matrix multiplies B by 16 by T to
[66:49] give us B by T by
[66:53] T right
[66:56] so for every row of B we're now going to
[66:58] have a t Square Matrix giving us the
[67:01] affinities and these are now the way so
[67:04] they're not zeros they are now coming
[67:06] from this dot product between the keys
[67:08] and the queries so this can now run I
[67:11] can I can run this and the weighted
[67:13] aggregation now is a function in a data
[67:16] Bandon manner between the keys and
[67:18] queries of these nodes so just
[67:20] inspecting what happened
[67:22] here the way takes on this form
[67:26] and you see that before way was uh just
[67:29] a constant so it was applied in the same
[67:31] way to all the batch elements but now
[67:33] every single batch elements will have
[67:34] different sort of we because uh every
[67:37] single batch element contains different
[67:39] uh tokens at different positions and so
[67:41] this is not data dependent so when we
[67:44] look at just the zeroth uh Row for
[67:47] example in the input these are the
[67:49] weights that came out and so you can see
[67:51] now that they're not just exactly
[67:53] uniform um and in particular as an
[67:55] example here for the last row this was
[67:58] the eighth token and the eighth token
[68:00] knows what content it has and it knows
[68:02] at what position it's in and now the E
[68:04] token based on that uh creates a query
[68:08] hey I'm looking for this kind of stuff
[68:10] um I'm a vowel I'm on the E position I'm
[68:12] looking for any consonant at positions
[68:14] up to four and then all the nodes get to
[68:18] emit keys and maybe one of the channels
[68:20] could be I am a I am a consonant and I
[68:23] am in a position up to four and that
[68:25] that key would have a high number in
[68:27] that specific Channel and that's how the
[68:29] query and the key when they do product
[68:31] they can find each other and create a
[68:33] high affinity and when they have a high
[68:35] Affinity like say uh this token was
[68:38] pretty interesting to uh to this eighth
[68:41] token when they have a high Affinity
[68:43] then through the softmax I will end up
[68:45] aggregating a lot of its information
[68:47] into my position and so I'll get to
[68:49] learn a lot about
[68:51] it now just this we're looking at way
[68:55] after this has already happened um let
[68:59] me erase this operation as well so let
[69:01] me erase the masking and the softmax
[69:03] just to show you the under the hood
[69:04] internals and how that works so without
[69:07] the masking in the softmax Whey comes
[69:09] out like this right this is the outputs
[69:11] of the do products um and these are the
[69:14] raw outputs and they take on values from
[69:15] negative you know two to positive two
[69:18] Etc so that's the raw interactions and
[69:21] raw affinities between all the nodes but
[69:24] now if I'm going if I'm a fifth node I
[69:26] will not want to aggregate anything from
[69:28] the sixth node seventh node and the
[69:30] eighth node so actually we use the upper
[69:32] triangular masking so those are not
[69:35] allowed to
[69:37] communicate and now we actually want to
[69:40] have a nice uh distribution uh so we
[69:42] don't want to aggregate negative .11 of
[69:45] this node that's crazy so instead we
[69:47] exponentiate and normalize and now we
[69:49] get a nice distribution that sums to one
[69:51] and this is telling us now in the data
[69:52] dependent manner how much of information
[69:54] to aggregate from any of these tokens in
[69:56] the
[69:58] past so that's way and it's not zeros
[70:01] anymore but but it's calculated in this
[70:04] way now there's one more uh part to a
[70:08] single self attention head and that is
[70:10] that when we do the aggregation we don't
[70:12] actually aggregate the tokens exactly we
[70:15] aggregate we produce one more value here
[70:17] and we call that the
[70:20] value so in the same way that we
[70:22] produced p and query we're also going to
[70:23] create a value
[70:26] and
[70:26] then here we don't
[70:30] aggregate X we calculate a v which is
[70:34] just achieved by uh propagating this
[70:37] linear on top of X again and then we
[70:40] output way multiplied by V so V is the
[70:44] elements that we aggregate or the the
[70:46] vectors that we aggregate instead of the
[70:47] raw
[70:48] X and now of course uh this will make it
[70:51] so that the output here of this single
[70:53] head will be 16 dimensional because that
[70:55] is the head
[70:57] size so you can think of X as kind of
[70:59] like private information to this token
[71:01] if you if you think about it that way so
[71:03] X is kind of private to this token so
[71:06] I'm a fifth token at some and I have
[71:08] some identity and uh my information is
[71:11] kept in Vector X and now for the
[71:14] purposes of the single head here's what
[71:16] I'm interested in here's what I have and
[71:20] if you find me interesting here's what I
[71:21] will communicate to you and that's
[71:23] stored in v and so V is the thing that
[71:26] gets aggregated for the purposes of this
[71:28] single head between the different
[71:30] notes and that's uh basically the self
[71:34] attention mechanism this is this is what
[71:36] it does there are a few notes that I
[71:39] would make like to make about attention
[71:41] number one attention is a communication
[71:44] mechanism you can really think about it
[71:46] as a communication mechanism where you
[71:48] have a number of nodes in a directed
[71:50] graph where basically you have edges
[71:52] pointed between noes like
[71:53] this and what happens is every node has
[71:56] some Vector of information and it gets
[71:58] to aggregate information via a weighted
[72:01] sum from all of the nodes that point to
[72:03] it and this is done in a data dependent
[72:06] manner so depending on whatever data is
[72:08] actually stored that you should not at
[72:09] any point in time now our graph doesn't
[72:13] look like this our graph has a different
[72:15] structure we have eight nodes because
[72:17] the block size is eight and there's
[72:18] always eight to
[72:20] tokens and uh the first node is only
[72:23] pointed to by itself the second node is
[72:25] pointed to by the first node and itself
[72:27] all the way up to the eighth node which
[72:29] is pointed to by all the previous nodes
[72:32] and itself and so that's the structure
[72:34] that our directed graph has or happens
[72:37] happens to have in Auto regressive sort
[72:38] of scenario like language modeling but
[72:41] in principle attention can be applied to
[72:42] any arbitrary directed graph and it's
[72:44] just a communication mechanism between
[72:46] the nodes the second note is that notice
[72:48] that there is no notion of space so
[72:51] attention simply acts over like a set of
[72:53] vectors in this graph and so by default
[72:56] these nodes have no idea where they are
[72:58] positioned in the space and that's why
[72:59] we need to encode them positionally and
[73:02] sort of give them some information that
[73:03] is anchored to a specific position so
[73:05] that they sort of know where they are
[73:08] and this is different than for example
[73:09] from convolution because if you're run
[73:11] for example a convolution operation over
[73:13] some input there's a very specific sort
[73:15] of layout of the information in space
[73:18] and the convolutional filters sort of
[73:20] act in space and so it's it's not like
[73:23] an attention in ATT ention is just a set
[73:26] of vectors out there in space they
[73:27] communicate and if you want them to have
[73:29] a notion of space you need to
[73:31] specifically add it which is what we've
[73:33] done when we calculated the um relative
[73:36] the positional encode encodings and
[73:38] added that information to the vectors
[73:40] the next thing that I hope is very clear
[73:41] is that the elements across the batch
[73:43] Dimension which are independent examples
[73:45] never talk to each other they're always
[73:47] processed independently and this is a
[73:49] batched matrix multiply that applies
[73:51] basically a matrix multiplication uh
[73:53] kind of in parallel across the batch
[73:54] dimension so maybe it would be more
[73:56] accurate to say that in this analogy of
[73:58] a directed graph we really have because
[74:00] the back size is four we really have
[74:03] four separate pools of eight nodes and
[74:05] those eight nodes only talk to each
[74:07] other but in total there's like 32 nodes
[74:08] that are being processed uh but there's
[74:11] um sort of four separate pools of eight
[74:13] you can look at it that way the next
[74:15] note is that here in the case of
[74:18] language modeling uh we have this
[74:20] specific uh structure of directed graph
[74:22] where the future tokens will not
[74:24] communicate to the Past tokens but this
[74:27] doesn't necessarily have to be the
[74:28] constraint in the general case and in
[74:30] fact in many cases you may want to have
[74:32] all of the uh noes talk to each other uh
[74:35] fully so as an example if you're doing
[74:37] sentiment analysis or something like
[74:38] that with a Transformer you might have a
[74:40] number of tokens and you may want to
[74:42] have them all talk to each other fully
[74:45] because later you are predicting for
[74:46] example the sentiment of the sentence
[74:49] and so it's okay for these NOS to talk
[74:50] to each other and so in those cases you
[74:53] will use an encoder block of self
[74:55] attention and uh all it means that it's
[74:58] an encoder block is that you will delete
[75:00] this line of code allowing all the noes
[75:02] to completely talk to each other what
[75:04] we're implementing here is sometimes
[75:06] called a decoder block and it's called a
[75:09] decoder because it is sort of like a
[75:12] decoding language and it's got this
[75:15] autor regressive format where you have
[75:17] to mask with the Triangular Matrix so
[75:19] that uh nodes from the future never talk
[75:22] to the Past because they would give away
[75:24] the answer
[75:25] and so basically in encoder blocks you
[75:27] would delete this allow all the noes to
[75:29] talk in decoder blocks this will always
[75:31] be present so that you have this
[75:33] triangular structure uh but both are
[75:35] allowed and attention doesn't care
[75:36] attention supports arbitrary
[75:38] connectivity between nodes the next
[75:40] thing I wanted to comment on is you keep
[75:41] me you keep hearing me say attention
[75:43] self attention Etc there's actually also
[75:45] something called cross attention what is
[75:47] the
[75:47] difference
[75:49] so basically the reason this attention
[75:52] is self attention is because because the
[75:55] keys queries and the values are all
[75:57] coming from the same Source from X so
[76:01] the same Source X produces Keys queries
[76:03] and values so these nodes are self
[76:05] attending but in principle attention is
[76:08] much more General than that so for
[76:10] example an encoder decoder Transformers
[76:12] uh you can have a case where the queries
[76:15] are produced from X but the keys and the
[76:17] values come from a whole separate
[76:18] external source and sometimes from uh
[76:21] encoder blocks that encode some context
[76:23] that we'd like to condition on
[76:25] and so the keys and the values will
[76:26] actually come from a whole separate
[76:28] Source those are nodes on the side and
[76:31] here we're just producing queries and
[76:32] we're reading off information from the
[76:34] side so cross attention is used when
[76:37] there's a separate source of nodes we'd
[76:40] like to pull information from into our
[76:42] nodes and it's self attention if we just
[76:45] have nodes that would like to look at
[76:46] each other and talk to each other so
[76:48] this attention here happens to be self
[76:51] attention but in principle um attention
[76:55] is a lot more General okay and the last
[76:57] note at this stage is if we come to the
[76:59] attention is all need paper here we've
[77:01] already implemented attention so given
[77:03] query key and value we've U multiplied
[77:06] the query and a key we've soft maxed it
[77:09] and then we are aggregating the values
[77:11] there's one more thing that we're
[77:12] missing here which is the dividing by
[77:13] one / square root of the head size the
[77:16] DK here is the head size why are they
[77:18] doing this finds this important so they
[77:21] call it the scaled attention and it's
[77:24] kind of like an important normalization
[77:25] to basically
[77:26] have the problem is if you have unit gsh
[77:29] and inputs so zero mean unit variance K
[77:32] and Q are unit gashin then if you just
[77:34] do we naively then you see that your we
[77:37] actually will be uh the variance will be
[77:38] on the order of head size which in our
[77:40] case is 16 but if you multiply by one
[77:43] over head size square root so this is
[77:45] square root and this is one
[77:47] over then the variance of we will be one
[77:50] so it will be
[77:52] preserved now why is this important
[77:54] you'll not notice that way
[77:56] here will feed into
[77:58] softmax and so it's really important
[78:00] especially at initialization that we be
[78:03] fairly diffuse so in our case here we
[78:06] sort of locked out here and we had a
[78:10] fairly diffuse numbers here so um like
[78:13] this now the problem is that because of
[78:15] softmax if weight takes on very positive
[78:18] and very negative numbers inside it
[78:20] softmax will actually converge towards
[78:22] one hot vectors and so I can illustrate
[78:25] that here um say we are applying softmax
[78:29] to a tensor of values that are very
[78:31] close to zero then we're going to get a
[78:33] diffuse thing out of
[78:34] softmax but the moment I take the exact
[78:36] same thing and I start sharpening it
[78:38] making it bigger by multiplying these
[78:40] numbers by eight for example you'll see
[78:42] that the softmax will start to sharpen
[78:44] and in fact it will sharpen towards the
[78:46] max so it will sharpen towards whatever
[78:48] number here is the highest and so um
[78:51] basically we don't want these values to
[78:52] be too extreme especially at
[78:53] initialization otherwise softmax will be
[78:55] way too peaky and um you're basically
[78:58] aggregating um information from like a
[79:01] single node every node just agregates
[79:03] information from a single other node
[79:04] that's not what we want especially at
[79:06] initialization and so the scaling is
[79:08] used just to control the variance at
[79:11] initialization okay so having said all
[79:13] that let's now take our self attention
[79:15] knowledge and let's uh take it for a
[79:17] spin so here in the code I created this
[79:19] head module and it implements a single
[79:22] head of self attention so you give it a
[79:24] head size and then here it creates the
[79:26] key query and the value linear layers
[79:29] typically people don't use biases in
[79:31] these uh so those are the linear
[79:33] projections that we're going to apply to
[79:34] all of our nodes now here I'm creating
[79:37] this Trill variable Trill is not a
[79:40] parameter of the module so in sort of
[79:41] pytorch naming conventions uh this is
[79:43] called a buffer it's not a parameter and
[79:46] you have to call it you have to assign
[79:47] it to the module using a register buffer
[79:49] so that creates the trill uh the triang
[79:52] lower triangular Matrix and we're given
[79:55] the input X this should look very
[79:56] familiar now we calculate the keys the
[79:58] queries we C calculate the attention
[80:00] scores inside way uh we normalize it so
[80:03] we're using scaled attention here then
[80:06] we make sure that uh future doesn't
[80:08] communicate with the past so this makes
[80:10] it a decoder block and then softmax and
[80:13] then aggregate the value and
[80:15] output then here in the language model
[80:17] I'm creating a head in the Constructor
[80:20] and I'm calling it self attention head
[80:22] and the head size I'm going to keep as
[80:24] the same and embed just for
[80:27] now and then here once we've encoded the
[80:31] information with the token embeddings
[80:32] and the position embeddings we're simply
[80:34] going to feed it into the self attention
[80:36] head and then the output of that is
[80:38] going to go into uh the decoder language
[80:42] modeling head and create the logits so
[80:44] this the sort of the simplest way to
[80:46] plug in a self attention component uh
[80:49] into our Network right now I had to make
[80:51] one more change which is that here in
[80:55] the generate uh we have to make sure
[80:57] that our idx that we feed into the model
[81:01] because now we're using positional
[81:02] embeddings we can never have more than
[81:04] block size coming in because if idx is
[81:07] more than block size then our position
[81:09] embedding table is going to run out of
[81:11] scope because it only has embeddings for
[81:12] up to block size and so therefore I
[81:15] added some uh code here to crop the
[81:17] context that we're going to feed into
[81:20] self um so that uh we never pass in more
[81:23] than block siiz elements
[81:25] so those are the changes and let's Now
[81:27] train the network okay so I also came up
[81:29] to the script here and I decreased the
[81:30] learning rate because uh the self
[81:32] attention can't tolerate very very high
[81:34] learning rates and then I also increased
[81:36] number of iterations because the
[81:37] learning rate is lower and then I
[81:39] trained it and previously we were only
[81:41] able to get to up to 2.5 and now we are
[81:43] down to 2.4 so we definitely see a
[81:46] little bit of an improvement from 2.5 to
[81:48] 2.4 roughly uh but the text is still not
[81:51] amazing so clearly the self attention
[81:53] head is doing some useful communication
[81:56] but um we still have a long way to go
[81:59] okay so now we've implemented the scale.
[82:01] product attention now next up and the
[82:02] attention is all you need paper there's
[82:05] something called multi-head attention
[82:07] and what is multi-head attention it's
[82:09] just applying multiple attentions in
[82:11] parallel and concatenating their results
[82:13] so they have a little bit of diagram
[82:15] here I don't know if this is super clear
[82:18] it's really just multiple attentions in
[82:20] parallel so let's Implement that fairly
[82:23] straightforward
[82:25] if we want a multi-head attention then
[82:27] we want multiple heads of self attention
[82:28] running in parallel so in pytorch we can
[82:32] do this by simply creating multiple
[82:35] heads so however heads how however many
[82:38] heads you want and then what is the head
[82:39] size of each and then we run all of them
[82:43] in parallel into a list and simply
[82:46] concatenate all of the outputs and we're
[82:48] concatenating over the channel
[82:50] Dimension so the way this looks now is
[82:53] we don't have just a single ATT
[82:56] that uh has a hit size of 32 because
[82:59] remember n Ed is
[83:00] 32 instead of having one Communication
[83:03] channel we now have four communication
[83:06] channels in parallel and each one of
[83:08] these communication channels typically
[83:10] will be uh smaller uh correspondingly so
[83:14] because we have four communication
[83:15] channels we want eight dimensional self
[83:18] attention and so from each Communication
[83:20] channel we're going to together eight
[83:22] dimensional vectors and then we have
[83:23] four of them and that concatenates to
[83:25] give us 32 which is the original and
[83:28] embed and so this is kind of similar to
[83:30] um if you're familiar with convolutions
[83:32] this is kind of like a group convolution
[83:34] uh because basically instead of having
[83:36] one large convolution we do convolution
[83:38] in groups and uh that's multi-headed
[83:40] self
[83:41] attention and so then here we just use
[83:44] essay heads self attention heads instead
[83:47] now I actually ran it and uh scrolling
[83:51] down I ran the same thing and then we
[83:53] now get this down to 2.28 roughly and
[83:57] the output is still the generation is
[83:58] still not amazing but clearly the
[84:00] validation loss is improving because we
[84:02] were at 2.4 just now and so it helps to
[84:05] have multiple communication channels
[84:07] because obviously these tokens have a
[84:09] lot to talk about they want to find the
[84:11] consonants the vowels they want to find
[84:13] the vowels just from certain positions
[84:15] uh they want to find any kinds of
[84:17] different things and so it helps to
[84:19] create multiple independent channels of
[84:20] communication gather lots of different
[84:22] types of data and then uh decode the
[84:25] output now going back to the paper for a
[84:27] second of course I didn't explain this
[84:28] figure in full detail but we are
[84:30] starting to see some components of what
[84:32] we've already implemented we have the
[84:33] positional encodings the token encodings
[84:35] that add we have the masked multi-headed
[84:37] attention implemented now here's another
[84:41] multi-headed attention which is a cross
[84:42] attention to an encoder which we haven't
[84:45] we're not going to implement in this
[84:46] case I'm going to come back to that
[84:48] later but I want you to notice that
[84:50] there's a feed forward part here and
[84:52] then this is grouped into a block that
[84:53] gets repeat it again and again now the
[84:56] feedforward part here is just a simple
[84:57] uh multi-layer perceptron
[85:00] um so the multi-headed so here position
[85:04] wise feed forward networks is just a
[85:06] simple little MLP so I want to start
[85:08] basically in a similar fashion also
[85:10] adding computation into the network and
[85:13] this computation is on a per node level
[85:16] so I've already implemented it and you
[85:18] can see the diff highlighted on the left
[85:20] here when I've added or changed things
[85:22] now before we had the self multi-headed
[85:25] self attention that did the
[85:26] communication but we went way too fast
[85:28] to calculate the logits so the tokens
[85:31] looked at each other but didn't really
[85:32] have a lot of time to think on what they
[85:35] found from the other tokens and so what
[85:38] I've implemented here is a little feet
[85:40] forward single layer and this little
[85:42] layer is just a linear followed by a Rel
[85:45] nonlinearity and that's that's it so
[85:48] it's just a little layer and then I call
[85:50] it feed
[85:52] forward um and embed
[85:54] and then this feed forward is just
[85:56] called sequentially right after the self
[85:58] attention so we self attend then we feed
[86:01] forward and you'll notice that the feet
[86:02] forward here when it's applying linear
[86:04] this is on a per token level all the
[86:06] tokens do this independently so the self
[86:09] attention is the communication and then
[86:11] once they've gathered all the data now
[86:13] they need to think on that data
[86:15] individually and so that's what feed
[86:16] forward is doing and that's why I've
[86:18] added it here now when I train this the
[86:21] validation LW actually continues to go
[86:23] down now to 2. 24 which is down from
[86:26] 2.28 uh the output still look kind of
[86:28] terrible but at least we've improved the
[86:31] situation and so as a preview we're
[86:34] going to now start to intersperse the
[86:37] communication with the computation and
[86:39] that's also what the Transformer does
[86:42] when it has blocks that communicate and
[86:44] then compute and it groups them and
[86:46] replicates them okay so let me show you
[86:49] what we'd like to do we'd like to do
[86:51] something like this we have a block and
[86:53] this block is is basically this part
[86:55] here except for the cross
[86:57] attention now the block basically
[86:59] intersperses communication and then
[87:01] computation the computation the
[87:03] communication is done using multi-headed
[87:05] selfelf attention and then the
[87:07] computation is done using a feed forward
[87:08] Network on all the tokens
[87:11] independently now what I've added here
[87:14] also is you'll
[87:16] notice this takes the number of
[87:18] embeddings in the embedding Dimension
[87:19] and number of heads that we would like
[87:21] which is kind of like group size in
[87:22] group convolution and and I'm saying
[87:24] that number of heads we'd like is four
[87:26] and so because this is 32 we calculate
[87:29] that because this is 32 the number of
[87:31] heads should be four um the head size
[87:34] should be eight so that everything sort
[87:36] of works out Channel wise um so this is
[87:39] how the Transformer structures uh sort
[87:41] of the uh the sizes typically so the
[87:44] head size will become eight and then
[87:45] this is how we want to intersperse them
[87:47] and then here I'm trying to create
[87:49] blocks which is just a sequential
[87:51] application of block block block so that
[87:53] we're interspersing communication feed
[87:55] forward many many times and then finally
[87:57] we decode now I actually tried to run
[88:01] this and the problem is this doesn't
[88:02] actually give a very good uh answer and
[88:05] very good result and the reason for that
[88:07] is we're start starting to actually get
[88:09] like a pretty deep neural net and deep
[88:11] neural Nets uh suffer from optimization
[88:13] issues and I think that's what we're
[88:14] kind of like slightly starting to run
[88:16] into so we need one more idea that we
[88:18] can borrow from the um Transformer paper
[88:21] to resolve those difficulties now there
[88:23] are two optimizations that dramatically
[88:25] help with the depth of these networks
[88:27] and make sure that the networks remain
[88:29] optimizable let's talk about the first
[88:31] one the first one in this diagram is you
[88:33] see this Arrow here and then this arrow
[88:36] and this Arrow those are skip
[88:38] connections or sometimes called residual
[88:40] connections they come from this paper uh
[88:43] the presidual learning for image
[88:44] recognition from about
[88:46] 2015 uh that introduced the concept now
[88:51] these are basically what it means is you
[88:53] transform data but then you have a skip
[88:55] connection with addition from the
[88:57] previous features now the way I like to
[89:00] visualize it uh that I prefer is the
[89:03] following here the computation happens
[89:05] from the top to bottom and basically you
[89:08] have this uh residual pathway and you
[89:11] are free to Fork off from the residual
[89:13] pathway perform some computation and
[89:15] then project back to the residual
[89:16] pathway via addition and so you go from
[89:19] the the uh inputs to the targets only
[89:22] via plus and plus plus and the reason
[89:25] this is useful is because during back
[89:27] propagation remember from our microG
[89:29] grad video earlier addition distributes
[89:32] gradients equally to both of its
[89:34] branches that that fed as the input and
[89:37] so the supervision or the gradients from
[89:40] the loss basically hop through every
[89:43] addition node all the way to the input
[89:46] and then also Fork off into the residual
[89:50] blocks but basically you have this
[89:52] gradient Super Highway that goes
[89:53] directly from the supervision all the
[89:55] way to the input unimpeded and then
[89:58] these viral blocks are usually
[89:59] initialized in the beginning so they
[90:01] contribute very very little if anything
[90:03] to the residual pathway they they are
[90:05] initialized that way so in the beginning
[90:07] they are sort of almost kind of like not
[90:09] there but then during the optimization
[90:11] they come online over time and they uh
[90:14] start to contribute but at least at the
[90:17] initialization you can go from directly
[90:19] supervision to the input gradient is
[90:21] unimpeded and just flows and then the
[90:23] blocks over time
[90:24] kick in and so that dramatically helps
[90:27] with the optimization so let's implement
[90:29] this so coming back to our block here
[90:31] basically what we want to do is we want
[90:33] to do xal
[90:35] X+ self attention and xal X+ self. feed
[90:39] forward so this is X and then we Fork
[90:43] off and do some communication and come
[90:45] back and we Fork off and we do some
[90:46] computation and come back so those are
[90:49] residual connections and then swinging
[90:51] back up here we also have to introd use
[90:54] this projection so nn.
[90:57] linear and uh this is going to be
[91:00] from after we concatenate this this is
[91:03] the prze and embed so this is the output
[91:05] of the self tension itself but then we
[91:08] actually want the uh to apply the
[91:11] projection and that's the
[91:13] result so the projection is just a
[91:15] linear transformation of the outcome of
[91:16] this
[91:17] layer so that's the projection back into
[91:20] the virual pathway and then here in a
[91:22] feet forward it's going to be the same
[91:23] same thing I could have a a self doot
[91:26] projection here as well but let me just
[91:28] simplify it and let me uh couple it
[91:32] inside the same sequential container and
[91:34] so this is the projection layer going
[91:36] back into the residual
[91:38] pathway and
[91:40] so that's uh well that's it so now we
[91:43] can train this so I implemented one more
[91:44] small change when you look into the
[91:47] paper again you see that the
[91:49] dimensionality of input and output is
[91:51] 512 for them and they're saying that the
[91:53] inner layer here in the feet forward has
[91:55] dimensionality of 248 so there's a
[91:57] multiplier of four and so the inner
[92:00] layer of the feet forward Network should
[92:02] be multiplied by four in terms of
[92:04] Channel sizes so I came here and I
[92:06] multiplied four times embed here for the
[92:08] feed forward and then from four times
[92:10] nmed coming back down to nmed when we go
[92:13] back to the pro uh to the projection so
[92:15] adding a bit of computation here and
[92:17] growing that layer that is in the
[92:19] residual block on the side of the
[92:21] residual
[92:22] pathway and then I train this and we
[92:24] actually get down all the way to uh 2.08
[92:27] validation loss and we also see that
[92:29] network is starting to get big enough
[92:30] that our train loss is getting ahead of
[92:32] validation loss so we're starting to see
[92:33] like a little bit of
[92:35] overfitting and um our our
[92:38] um uh Generations here are still not
[92:41] amazing but at least you see that we can
[92:42] see like is here this now grief syn like
[92:46] this starts to almost look like English
[92:48] so um yeah we're starting to really get
[92:50] there okay and the second Innovation
[92:52] that is very helpful for optimizing very
[92:54] deep neural networks is right here so we
[92:57] have this addition now that's the
[92:58] residual part but this Norm is referring
[93:00] to something called layer Norm so layer
[93:03] Norm is implemented in pytorch it's a
[93:04] paper that came out a while back here
[93:09] um and layer Norm is very very similar
[93:11] to bash Norm so remember back to our
[93:14] make more series part three we
[93:16] implemented bash
[93:17] normalization and uh bash normalization
[93:19] basically just made sure that um Across
[93:22] The Bash dimension any individual neuron
[93:25] had unit uh Gan um distribution so it
[93:30] was zero mean and unit standard
[93:32] deviation one standard deviation output
[93:35] so what I did here is I'm copy pasting
[93:37] the bashor 1D that we developed in our
[93:39] make more series and see here we can
[93:42] initialize for example this module and
[93:44] we can have a batch of 32 100
[93:47] dimensional vectors feeding through the
[93:48] bachor layer so what this does is it
[93:52] guarantees that when we look at just the
[93:54] zeroth column it's a zero mean one
[93:58] standard deviation so it's normalizing
[94:00] every single column of this uh input now
[94:04] the rows are not uh going to be
[94:06] normalized by default because we're just
[94:08] normalizing columns so let's now
[94:10] Implement layer Norm uh it's very
[94:12] complicated look we come here we change
[94:15] this from zero to one so we don't
[94:18] normalize The Columns we normalize the
[94:20] rows and now we've implemented layer
[94:23] Norm
[94:25] so now the columns are not going to be
[94:28] normalized um but the rows are going to
[94:31] be normalized for every individual
[94:33] example it's 100 dimensional Vector is
[94:35] normalized uh in this way and because
[94:38] our computation Now does not span across
[94:40] examples we can delete all of this
[94:43] buffers stuff uh because uh we can
[94:45] always apply this operation and don't
[94:48] need to maintain any running buffers so
[94:50] we don't need the
[94:52] buffers uh we
[94:54] don't There's no distinction between
[94:56] training and test
[94:58] time uh and we don't need these running
[95:00] buffers we do keep gamma and beta we
[95:03] don't need the momentum we don't care if
[95:05] it's training or not and this is now a
[95:08] layer
[95:09] norm and it normalizes the rows instead
[95:12] of the columns and this here is
[95:15] identical to basically this here so
[95:19] let's now Implement layer Norm in our
[95:21] Transformer before I incorporate the
[95:23] layer Norm I just wanted to note that as
[95:25] I said very few details about the
[95:27] Transformer have changed in the last 5
[95:28] years but this is actually something
[95:30] that slightly departs from the original
[95:31] paper you see that the ADD and Norm is
[95:34] applied after the
[95:36] transformation but um in now it is a bit
[95:40] more uh basically common to apply the
[95:42] layer Norm before the transformation so
[95:44] there's a reshuffling of the layer Norms
[95:46] uh so this is called the prorm
[95:48] formulation and that's the one that
[95:49] we're going to implement as well so
[95:50] select deviation from the original paper
[95:53] basically we need two layer Norms layer
[95:55] Norm one is uh NN do layer norm and we
[95:59] tell it how many um what is the
[96:01] embedding Dimension and we need the
[96:03] second layer norm and then here the
[96:06] layer Norms are applied immediately on X
[96:09] so self. layer Norm one applied on X and
[96:13] self. layer Norm two applied on X before
[96:15] it goes into self attention and feed
[96:18] forward and uh the size of the layer
[96:20] Norm here is an ed so 32 so when the
[96:23] layer Norm is normalizing our features
[96:26] it is uh the normalization here uh
[96:30] happens the mean and the variance are
[96:32] taken over 32 numbers so the batch and
[96:34] the time act as batch Dimensions both of
[96:37] them so this is kind of like a per token
[96:40] um transformation that just normalizes
[96:42] the features and makes them a unit mean
[96:46] uh unit Gan at
[96:48] initialization but of course because
[96:50] these layer Norms inside it have these
[96:52] gamma and beta training
[96:54] parameters uh the layer Norm will U
[96:57] eventually create outputs that might not
[96:59] be unit gion but the optimization will
[97:01] determine that so for now this is the uh
[97:05] this is incorporating the layer norms
[97:06] and let's train them on okay so I let it
[97:09] run and we see that we get down to 2.06
[97:12] which is better than the previous 2.08
[97:14] so a slight Improvement by adding the
[97:15] layer norms and I'd expect that they
[97:17] help uh even more if we had bigger and
[97:19] deeper Network one more thing I forgot
[97:21] to add is that there should be a layer
[97:23] Norm here also typically as at the end
[97:26] of the Transformer and right before the
[97:28] final uh linear layer that decodes into
[97:31] vocabulary so I added that as well so at
[97:35] this stage we actually have a pretty
[97:36] complete uh Transformer according to the
[97:38] original paper and it's a decoder only
[97:40] Transformer I'll I'll talk about that in
[97:42] a second uh but at this stage uh the
[97:44] major pieces are in place so we can try
[97:46] to scale this up and see how well we can
[97:47] push this number now in order to scale
[97:50] out the model I had to perform some
[97:51] cosmetic changes here to make it nicer
[97:54] so I introduced this variable called n
[97:56] layer which just specifies how many
[97:57] layers of the blocks we're going to have
[98:01] I created a bunch of blocks and we have
[98:02] a new variable number of heads as well I
[98:05] pulled out the layer Norm here and uh so
[98:07] this is identical now one thing that I
[98:10] did briefly change is I added a Dropout
[98:13] so Dropout is something that you can add
[98:15] right before the residual connection
[98:17] back right before the connection back
[98:19] into the residual pathway so we can drop
[98:22] out that as l layer here we can drop out
[98:26] uh here at the end of the multi-headed
[98:27] exension as well and we can also drop
[98:30] out here uh when we calculate the um
[98:34] basically affinities and after the
[98:36] softmax we can drop out some of those so
[98:38] we can randomly prevent some of the
[98:40] nodes from
[98:41] communicating and so Dropout uh comes
[98:43] from this paper from 2014 or so and
[98:49] basically it takes your neural
[98:50] nut and it randomly every forward
[98:53] backward pass shuts off some subset of
[98:56] uh neurons so randomly drops them to
[98:59] zero and trains without them and what
[99:02] this does effectively is because the
[99:04] mask of what's being dropped out is
[99:06] changed every single forward backward
[99:07] pass it ends up kind of uh training an
[99:11] ensemble of sub networks and then at
[99:13] test time everything is fully enabled
[99:15] and kind of all of those sub networks
[99:16] are merged into a single Ensemble if you
[99:18] can if you want to think about it that
[99:20] way so I would read the paper to get the
[99:22] full detail for now we're just going to
[99:24] stay on the level of this is a
[99:25] regularization technique and I added it
[99:28] because I'm about to scale up the model
[99:30] quite a bit and I was concerned about
[99:32] overfitting so now when we scroll up to
[99:34] the top uh we'll see that I changed a
[99:36] number of hyper parameters here about
[99:38] our neural nut so I made the batch size
[99:40] be much larger now it's 64 I changed the
[99:43] block size to be 256 so previously it
[99:46] was just eight eight characters of
[99:47] context now it is 256 characters of
[99:50] context to predict the 257th
[99:54] uh I brought down the learning rate a
[99:55] little bit because the neural net is now
[99:57] much bigger so I brought down the
[99:58] learning rate the embedding Dimension is
[100:01] now 384 and there are six heads so 384
[100:05] divide 6 means that every head is 64
[100:08] dimensional as it as a standard and then
[100:11] there's going to be six layers of that
[100:13] and the Dropout will be at 02 so every
[100:15] forward backward pass 20% of all of
[100:18] these um intermediate calculations are
[100:21] disabled and dropped to zero
[100:24] and then I already trained this and I
[100:25] ran it so uh drum roll how well does it
[100:28] perform so let me just scroll up
[100:31] here we get a validation loss of
[100:34] 1.48 which is actually quite a bit of an
[100:37] improvement on what we had before which
[100:38] I think was 2.07 so it went from 2.07
[100:41] all the way down to 1.48 just by scaling
[100:43] up this neural nut with the code that we
[100:45] have and this of course ran for a lot
[100:47] longer this maybe trained for I want to
[100:49] say about 15 minutes on my a100 GPU so
[100:52] that's a pretty a GPU and if you don't
[100:54] have a GPU you're not going to be able
[100:56] to reproduce this uh on a CPU this would
[100:59] be um I would not run this on a CPU or
[101:01] MacBook or something like that you'll
[101:03] have to Brak down the number of uh
[101:04] layers and the embedding Dimension and
[101:06] so on uh but in about 15 minutes we can
[101:09] get this kind of a result and um I'm
[101:12] printing some of the Shakespeare here
[101:15] but what I did also is I printed 10,000
[101:17] characters so a lot more and I wrote
[101:18] them to a file and so here we see some
[101:21] of the outputs
[101:24] so it's a lot more recognizable as the
[101:26] input text file so the input text file
[101:29] just for reference looked like this so
[101:31] there's always like someone speaking in
[101:33] this manner and uh our predictions now
[101:37] take on that form except of course
[101:40] they're they're nonsensical when you
[101:41] actually read them
[101:43] so it is every crimp tap be a house oh
[101:47] those
[101:48] prepation we give
[101:51] heed um you know
[101:56] Oho sent me you mighty
[101:59] Lord anyway so you can read through this
[102:02] um it's nonsensical of course but this
[102:04] is just a Transformer trained on a
[102:06] character level for 1 million characters
[102:09] that come from Shakespeare so there's
[102:10] sort of like blabbers on in Shakespeare
[102:12] like manner but it doesn't of course
[102:14] make sense at this scale uh but I think
[102:18] I think still a pretty good
[102:19] demonstration of what's
[102:20] possible so now
[102:24] I think uh that kind of like concludes
[102:26] the programming section of this video we
[102:28] basically kind of uh did a pretty good
[102:30] job and um of implementing this
[102:32] Transformer uh but the picture doesn't
[102:35] exactly match up to what we've done so
[102:37] what's going on with all these digital
[102:38] Parts here so let me finish explaining
[102:41] this architecture and why it looks so
[102:43] funky basically what's happening here is
[102:45] what we implemented here is a decoder
[102:47] only Transformer so there's no component
[102:50] here this part is called the encoder and
[102:52] there's no cross attention block here
[102:55] our block only has a self attention and
[102:58] the feet forward so it is missing this
[103:00] third in between piece here this piece
[103:03] does cross attention so we don't have it
[103:05] and we don't have the encoder we just
[103:07] have the decoder and the reason we have
[103:08] a decoder only uh is because we are just
[103:12] uh generating text and it's
[103:13] unconditioned on anything we're just
[103:15] we're just blabbering on according to a
[103:16] given data set what makes it a decoder
[103:19] is that we are using the Triangular mask
[103:21] in our uh trans former so it has this
[103:24] Auto regressive property where we can
[103:26] just uh go and sample from it so the
[103:28] fact that it's using the Triangular
[103:30] triangular mask to mask out the
[103:32] attention makes it a decoder and it can
[103:34] be used for language modeling now the
[103:37] reason that the original paper had an
[103:39] incoder decoder architecture is because
[103:41] it is a machine translation paper so it
[103:43] is concerned with a different setting in
[103:45] particular it expects some uh tokens
[103:49] that encode say for example French and
[103:52] then it is expecting to decode the
[103:54] translation in English so so you
[103:56] typically these here are special tokens
[103:59] so you are expected to read in this and
[104:02] condition on it and then you start off
[104:04] the generation with a special token
[104:05] called start so this is a special new
[104:08] token um that you introduce and always
[104:10] place in the beginning and then the
[104:12] network is expected to Output neural
[104:15] networks are awesome and then a special
[104:17] end token to finish the
[104:20] generation so this part here will be
[104:23] decoded exactly as we we've done it
[104:25] neural networks are awesome will be
[104:27] identical to what we did but unlike what
[104:29] we did they wanton to condition the
[104:32] generation on some additional
[104:34] information and in that case this
[104:36] additional information is the French
[104:38] sentence that they should be
[104:39] translating so what they do now is they
[104:42] bring in the encoder now the encoder
[104:45] reads this part here so we're only going
[104:48] to take the part of French and we're
[104:50] going to uh create tokens from it
[104:52] exactly as we've seen in our video and
[104:54] we're going to put a Transformer on it
[104:57] but there's going to be no triangular
[104:58] mask and so all the tokens are allowed
[105:00] to talk to each other as much as they
[105:02] want and they're just encoding
[105:04] whatever's the content of this French uh
[105:07] sentence once they've encoded it they
[105:10] they basically come out in the top here
[105:13] and then what happens here is in our
[105:14] decoder which does the uh language
[105:17] modeling there's an additional
[105:20] connection here to the outputs of the
[105:22] encoder
[105:23] and that is brought in through a cross
[105:26] attention so the queries are still
[105:28] generated from X but now the keys and
[105:30] the values are coming from the side the
[105:32] keys and the values are coming from the
[105:34] top generated by the nodes that came
[105:36] outside of the de the encoder and those
[105:40] tops the keys and the values there the
[105:42] top of it feed in on a side into every
[105:45] single block of the decoder and so
[105:47] that's why there's an additional cross
[105:49] attention and really what it's doing is
[105:51] it's conditioning the decoding
[105:53] not just on the past of this current
[105:55] decoding but also on having seen the
[105:59] full fully encoded French um prompt sort
[106:04] of and so it's an encoder decoder model
[106:06] which is why we have those two
[106:07] Transformers an additional block and so
[106:09] on so we did not do this because we have
[106:12] no we have nothing to encode there's no
[106:13] conditioning we just have a text file
[106:15] and we just want to imitate it and
[106:16] that's why we are using a decoder only
[106:19] Transformer exactly as done in
[106:21] GPT okay okay so now I wanted to do a
[106:24] very brief walkthrough of nanog GPT
[106:26] which you can find in my GitHub and uh
[106:28] nanog GPT is basically two files of
[106:30] Interest there's train.py and model.py
[106:33] train.py is all the boilerplate code for
[106:35] training the network it is basically all
[106:38] the stuff that we had here it's the
[106:40] training loop it's just that it's a lot
[106:42] more complicated because we're saving
[106:44] and loading checkpoints and pre-trained
[106:46] weights and we are uh decaying the
[106:48] learning rate and compiling the model
[106:50] and using distributed training across
[106:51] multiple nodes or GP use so the training
[106:54] Pi gets a little bit more hairy
[106:56] complicated uh there's more options Etc
[106:59] but the model.py should look very very
[107:01] um similar to what we've done here in
[107:04] fact the model is is almost identical so
[107:08] first here we have the causal self
[107:09] attention block and all of this should
[107:11] look very very recognizable to you we're
[107:13] producing queries Keys values we're
[107:16] doing Dot products we're masking
[107:18] applying soft Maxs optionally dropping
[107:20] out and here we are pulling the wi the
[107:23] values what is different here is that in
[107:25] our code I have separated out the
[107:30] multi-headed detention into just a
[107:31] single individual head and then here I
[107:34] have multiple heads and I explicitly
[107:36] concatenate them whereas here uh all of
[107:39] it is implemented in a batched manner
[107:41] inside a single causal self attention
[107:43] and so we don't just have a b and a T
[107:45] and A C Dimension we also end up with a
[107:47] fourth dimension which is the heads and
[107:50] so it just gets a lot more sort of hairy
[107:52] because we have four dimensional array
[107:54] um tensors now but it is um equivalent
[107:57] mathematically so the exact same thing
[107:59] is happening as what we have it's just
[108:01] it's a bit more efficient because all
[108:02] the heads are now treated as a batch
[108:04] Dimension as
[108:05] well then we have the multier perceptron
[108:08] it's using the Galu nonlinearity which
[108:10] is defined here except instead of Ru and
[108:13] this is done just because opening I used
[108:14] it and I want to be able to load their
[108:17] checkpoints uh the blocks of the
[108:19] Transformer are identical to communicate
[108:21] in the compute phase as we saw and then
[108:23] the GPT will be identical we have the
[108:25] position encodings token encodings the
[108:27] blocks the layer Norm at the end uh the
[108:30] final linear layer and this should look
[108:33] all very recognizable and there's a bit
[108:35] more here because I'm loading
[108:36] checkpoints and stuff like that I'm
[108:38] separating out the parameters into those
[108:40] that should be weight decayed and those
[108:42] that
[108:42] shouldn't um but the generate function
[108:44] should also be very very similar so a
[108:47] few details are different but you should
[108:48] definitely be able to look at this uh
[108:51] file and be able to understand little
[108:52] the pieces now so let's now bring things
[108:55] back to chat GPT what would it look like
[108:57] if we wanted to train chat GPT ourselves
[108:59] and how does it relate to what we
[109:00] learned today well to train in chat GPT
[109:03] there are roughly two stages first is
[109:05] the pre-training stage and then the
[109:07] fine-tuning stage in the pre-training
[109:09] stage uh we are training on a large
[109:12] chunk of internet and just trying to get
[109:14] a first decoder only Transformer to
[109:17] babble text so it's very very similar to
[109:20] what we've done ourselves except we've
[109:23] done like a tiny little baby
[109:24] pre-training step um and so in our case
[109:28] uh this is how you print a number of
[109:30] parameters I printed it and it's about
[109:32] 10 million so this Transformer that I
[109:35] created here to create little
[109:37] Shakespeare um Transformer was about 10
[109:40] million parameters our data set is
[109:42] roughly 1 million uh characters so
[109:45] roughly 1 million tokens but you have to
[109:47] remember that opening I is different
[109:48] vocabulary they're not on the Character
[109:50] level they use these um subword chunks
[109:53] of words and so they have a vocabulary
[109:55] of 50,000 roughly elements and so their
[109:58] sequences are a bit more condensed so
[110:01] our data set the Shakespeare data set
[110:03] would be probably around 300,000 uh
[110:05] tokens in the open AI vocabulary roughly
[110:09] so we trained about 10 million parameter
[110:11] model on roughly 300,000 tokens now when
[110:14] you go to the gpt3
[110:16] paper and you look at the Transformers
[110:20] that they trained they trained a number
[110:22] of trans Transformers of different sizes
[110:24] but the biggest Transformer here has 175
[110:27] billion parameters uh so ours is again
[110:29] 10 million they used this number of
[110:31] layers in the Transformer this is the
[110:34] nmed this is the number of heads and
[110:36] this is the head size and then this is
[110:39] the batch size uh so ours was
[110:43] 65 and the learning rate is similar now
[110:46] when they train this Transformer they
[110:47] trained on 300 billion tokens so again
[110:51] remember ours is about 300,000
[110:53] so this is uh about a millionfold
[110:56] increase and this number would not be
[110:57] even that large by today's standards
[110:59] you'd be going up uh 1 trillion and
[111:01] above so they are training a
[111:04] significantly larger
[111:06] model on uh a good chunk of the internet
[111:10] and that is the pre-training stage but
[111:12] otherwise these hyper parameters should
[111:13] be fairly recognizable to you and the
[111:15] architecture is actually like nearly
[111:17] identical to what we implemented
[111:18] ourselves but of course it's a massive
[111:20] infrastructure challenge to train this
[111:22] you're talking about typically thousands
[111:24] of gpus having to you know talk to each
[111:27] other to train models of this size so
[111:29] that's just a pre-training stage now
[111:32] after you complete the pre-training
[111:33] stage uh you don't get something that
[111:35] responds to your questions with answers
[111:38] and is not helpful and Etc you get a
[111:40] document
[111:41] completer right so it babbles but it
[111:44] doesn't Babble Shakespeare it babbles
[111:46] internet it will create arbitrary news
[111:48] articles and documents and it will try
[111:50] to complete documents because that's
[111:51] what it's trained for it's trying to
[111:52] complete the sequence so when you give
[111:54] it a question it would just uh
[111:56] potentially just give you more questions
[111:58] it would follow with more questions it
[112:00] will do whatever it looks like the some
[112:02] close document would do in the training
[112:05] data on the internet and so who knows
[112:07] you're getting kind of like undefined
[112:08] Behavior it might basically answer with
[112:11] to questions with other questions it
[112:13] might ignore your question it might just
[112:15] try to complete some news article it's
[112:17] totally unineed as we say so the second
[112:20] fine-tuning stage is to actually align
[112:22] it to be an assistant and uh this is the
[112:25] second stage and so this chat GPT block
[112:28] post from openi talks a little bit about
[112:30] how the stage is achieved we basically
[112:34] um there's roughly three steps to to
[112:36] this stage uh so what they do here is
[112:39] they start to collect training data that
[112:41] looks specifically like what an
[112:42] assistant would do so these are
[112:44] documents that have to format where the
[112:46] question is on top and then an answer is
[112:47] below and they have a large number of
[112:50] these but probably not on the order of
[112:51] the internet uh this is probably on the
[112:53] of maybe thousands of examples and so
[112:58] they they then fine-tune the model to
[113:00] basically only focus on documents that
[113:03] look like that and so you're starting to
[113:05] slowly align it so it's going to expect
[113:07] a question at the top and it's going to
[113:08] expect to complete the answer and uh
[113:11] these very very large models are very
[113:13] sample efficient during their
[113:14] fine-tuning so this actually somehow
[113:16] works but that's just step one that's
[113:19] just fine tuning so then they actually
[113:20] have more steps where okay the second
[113:23] step is you let the model respond and
[113:25] then different Raiders look at the
[113:27] different responses and rank them for
[113:29] their preference as to which one is
[113:30] better than the other they use that to
[113:32] train a reward model so they can predict
[113:35] uh basically using a different network
[113:37] how much of any candidate
[113:39] response would be desirable and then
[113:43] once they have a reward model they run
[113:45] po which is a form of polic policy
[113:47] gradient um reinforcement learning
[113:49] Optimizer to uh fine-tune this sampling
[113:53] policy uh so that the answers that the
[113:55] GP chat GPT now generates are expected
[113:59] to score a high reward according to the
[114:02] reward model and so basically there's a
[114:04] whole aligning stage here or fine-tuning
[114:07] stage it's got multiple steps in between
[114:09] there as well and it takes the model
[114:11] from being a document completer to a
[114:14] question answerer and that's like a
[114:16] whole separate stage a lot of this data
[114:19] is not available publicly it is internal
[114:21] to open AI and uh it's much harder to
[114:24] replicate this stage um and so that's
[114:27] roughly what would give you a chat GPT
[114:29] and nanog GPT focuses on the
[114:31] pre-training stage okay and that's
[114:32] everything that I wanted to cover today
[114:35] so we trained to summarize a decoder
[114:38] only Transformer following this famous
[114:41] paper attention is all you need from
[114:43] 2017 and so that's basically a GPT we
[114:47] trained it on Tiny Shakespeare and got
[114:50] sensible results
[114:52] all of the training code is
[114:54] roughly 200 lines of code I will be
[114:57] releasing this um code base so also it
[115:01] comes with all the git log commits along
[115:04] the way as we built it
[115:05] up in addition to this code I'm going to
[115:08] release the um notebook of course the
[115:10] Google collab and I hope that gave you a
[115:13] sense for how you can train um these
[115:16] models like say gpt3 that will be um
[115:19] architecturally basically identical to
[115:20] what we have but they are somewhere
[115:22] between 10,000 and 1 million times
[115:24] bigger depending on how you count and so
[115:27] uh that's all I have for now uh we did
[115:30] not talk about any of the fine-tuning
[115:32] stages that would typically go on top of
[115:33] this so if you're interested in
[115:35] something that's not just language
[115:36] modeling but you actually want to you
[115:38] know say perform tasks um or you want
[115:40] them to be aligned in a specific way or
[115:43] you want um to detect sentiment or
[115:45] anything like that basically anytime you
[115:47] don't want something that's just a
[115:48] document completer you have to complete
[115:50] further stages of fine tuning which did
[115:52] not cover uh and that could be simple
[115:55] supervised fine tuning or it can be
[115:57] something more fancy like we see in chat
[115:58] jpt where we actually train a reward
[116:00] model and then do rounds of Po to uh
[116:03] align it with respect to the reward
[116:04] model so there's a lot more that can be
[116:06] done on top of it I think for now we're
[116:08] starting to get to about two hours Mark
[116:10] uh so I'm going to um kind of finish
[116:13] here uh I hope you enjoyed the lecture
[116:15] uh and uh yeah go forth and transform
[116:18] see you later