[0:00] hi everyone so by now you have probably [0:02] heard of chat GPT it has taken the world [0:04] and AI Community by storm and it is a [0:07] system that allows you to interact with [0:09] an AI and give it text based tasks so [0:12] for example we can ask chat GPT to write [0:15] us a small Hau about how important it is [0:16] that people understand Ai and then they [0:18] can use it to improve the world and make [0:20] it more prosperous so when we run this [0:23] AI knowledge brings prosperity for all [0:25] to see Embrace its [0:27] power okay not bad and so you could see [0:29] that chpt went from left to right and [0:32] generated all these words SE sort of [0:35] sequentially now I asked it already the [0:37] exact same prompt a little bit earlier [0:39] and it generated a slightly different [0:41] outcome ai's power to grow ignorance [0:44] holds us back learn Prosperity weights [0:47] so uh pretty good in both cases and [0:49] slightly different so you can see that [0:50] chat GPT is a probabilistic system and [0:52] for any one prompt it can give us [0:54] multiple answers sort of uh replying to [0:57] it now this is just one example of a [0:59] problem people have come up with many [1:01] many examples and there are entire [1:03] websites that index interactions with [1:06] chpt and so many of them are quite [1:08] humorous explain HTML to me like I'm a [1:10] dog uh write release notes for chess 2 [1:14] write a note about Elon Musk buying a [1:16] Twitter and so on so as an example uh [1:20] please write a breaking news article [1:21] about a leaf falling from a [1:23] tree uh and a shocking turn of events a [1:26] leaf has fallen from a tree in the local [1:28] park Witnesses report that the leaf [1:30] which was previously attached to a [1:31] branch of a tree attached itself and [1:33] fell to the ground very dramatic so you [1:36] can see that this is a pretty remarkable [1:37] system and it is what we call a language [1:40] model uh because it um it models the [1:43] sequence of words or characters or [1:46] tokens more generally and it knows how [1:49] sort of words follow each other in [1:50] English language and so from its [1:52] perspective what it is doing is it is [1:55] completing the sequence so I give it the [1:57] start of a sequence and it completes the [2:00] sequence with the outcome and so it's a [2:02] language model in that sense now I would [2:05] like to focus on the under the hood of [2:07] um under the hood components of what [2:09] makes CH GPT work so what is the neural [2:12] network under the hood that models the [2:14] sequence of these words and that comes [2:17] from this paper called attention is all [2:19] you need in 2017 a landmark paper a [2:23] landmark paper in AI that produced and [2:25] proposed the Transformer [2:27] architecture so GPT is uh short for [2:31] generally generatively pre-trained [2:33] Transformer so Transformer is the neuron [2:35] nut that actually does all the heavy [2:36] lifting under the hood it comes from [2:39] this paper in 2017 now if you read this [2:41] paper this uh reads like a pretty random [2:44] machine translation paper and that's [2:46] because I think the authors didn't fully [2:47] anticipate the impact that the [2:49] Transformer would have on the field and [2:51] this architecture that they produced in [2:52] the context of machine translation in [2:54] their case actually ended up taking over [2:57] uh the rest of AI in the next 5 years [3:00] after and so this architecture with [3:02] minor changes was copy pasted into a [3:05] huge amount of applications in AI in [3:07] more recent years and that includes at [3:10] the core of chat GPT now we are not [3:13] going to what I'd like to do now is I'd [3:15] like to build out something like chat [3:17] GPT but uh we're not going to be able to [3:19] of course reproduce chat GPT this is a [3:21] very serious production grade system it [3:23] is trained on uh a good chunk of [3:26] internet and then there's a lot of uh [3:29] pre-training and fine-tuning stages to [3:31] it and so it's very complicated what I'd [3:33] like to focus on is just to train a [3:36] Transformer based language model and in [3:38] our case it's going to be a character [3:40] level language model I still think that [3:43] is uh very educational with respect to [3:45] how these systems work so I don't want [3:47] to train on the chunk of Internet we [3:48] need a smaller data set in this case I [3:51] propose that we work with uh my favorite [3:53] toy data set it's called tiny [3:55] Shakespeare and um what it is is [3:57] basically it's a concatenation of all of [3:59] the works of sh Shakespeare in my [4:00] understanding and so this is all of [4:02] Shakespeare in a single file uh this [4:05] file is about 1 megab and it's just all [4:07] of [4:08] Shakespeare and what we are going to do [4:10] now is we're going to basically model [4:12] how these characters uh follow each [4:14] other so for example given a chunk of [4:16] these characters like this uh given some [4:19] context of characters in the past the [4:22] Transformer neural network will look at [4:24] the characters that I've highlighted and [4:26] is going to predict that g is likely to [4:28] come next in the sequence and it's going [4:30] to do that because we're going to train [4:31] that Transformer on Shakespeare and it's [4:34] just going to try to produce uh [4:36] character sequences that look like this [4:39] and in that process is going to model [4:40] all the patterns inside this data so [4:43] once we've trained the system i' just [4:45] like to give you a preview we can [4:47] generate infinite Shakespeare and of [4:49] course it's a fake thing that looks kind [4:51] of like [4:53] Shakespeare [4:55] um apologies for there's some Jank that [4:59] I'm not able to resolve in in here but [5:02] um you can see how this is going [5:05] character by character and it's kind of [5:07] like predicting Shakespeare like [5:09] language so verily my Lord the sites [5:12] have left the again the king coming with [5:15] my curses with precious pale and then [5:19] tranos say something else Etc and this [5:21] is just coming out of the Transformer in [5:23] a very similar manner as it would come [5:25] out in chat GPT in our case character by [5:27] character in chat GPT uh it's coming out [5:31] on the token by token level and tokens [5:33] are these sort of like little subword [5:35] pieces so they're not Word level they're [5:36] kind of like word chunk [5:38] level um and now I've already written [5:43] this entire code uh to train these [5:45] Transformers um and it is in a GitHub [5:48] repository that you can find and it's [5:50] called nanog [5:51] GPT so nanog GPT is a repository that [5:54] you can find in my GitHub and it's a [5:56] repository for training Transformers um [5:59] on any given text and what I think is [6:02] interesting about it because there's [6:03] many ways to train Transformers but this [6:05] is a very simple implementation so it's [6:06] just two files of 300 lines of code each [6:10] one file defines the GPT model the [6:12] Transformer and one file trains it on [6:14] some given Text data set and here I'm [6:17] showing that if you train it on a open [6:18] web Text data set which is a fairly [6:20] large data set of web pages then I [6:22] reproduce the the performance of [6:25] gpt2 so gpt2 is an early version of open [6:29] AI GPT uh from 2017 if I recall [6:32] correctly and I've only so far [6:34] reproduced the the smallest 124 million [6:36] parameter model uh but basically this is [6:38] just proving that the codebase is [6:39] correctly arranged and I'm able to load [6:42] the uh neural network weights that openi [6:45] has released later so you can take a [6:48] look at the finished code here in N GPT [6:50] but what I would like to do in this [6:51] lecture is I would like to basically uh [6:55] write this repository from scratch so [6:57] we're going to begin with an empty file [6:59] and we're we're going to define a [7:00] Transformer piece by piece we're going [7:03] to train it on the tiny Shakespeare data [7:05] set and we'll see how we can then uh [7:08] generate infinite Shakespeare and of [7:10] course this can copy paste to any [7:12] arbitrary Text data set uh that you like [7:14] uh but my goal really here is to just [7:16] make you understand and appreciate uh [7:18] how under the hood chat GPT works and um [7:22] really all that's required is a [7:24] Proficiency in Python and uh some basic [7:27] understanding of um calculus and [7:29] statistics [7:30] and it would help if you also see my [7:32] previous videos on the same YouTube [7:34] channel in particular my make more [7:35] series where I um Define smaller and [7:40] simpler neural network language models [7:42] uh so multi perceptrons and so on it [7:45] really introduces the language modeling [7:46] framework and then uh here in this video [7:49] we're going to focus on the Transformer [7:50] neural network itself okay so I created [7:53] a new Google collab uh jup notebook here [7:57] and this will allow me to later easily [7:58] share this code that we're going to [8:00] develop together uh with you so you can [8:01] follow along so this will be in a video [8:03] description uh later now here I've just [8:07] done some preliminaries I downloaded the [8:09] data set the tiny Shakespeare data set [8:10] at this URL and you can see that it's [8:12] about a 1 Megabyte file then here I open [8:15] the input.txt file and just read in all [8:17] the text of the string and we see that [8:20] we are working with 1 million characters [8:22] roughly and the first 1,000 characters [8:24] if we just print them out are basically [8:26] what you would expect this is the first [8:28] 1,000 characters of the tiny Shakespeare [8:30] data set roughly up to here so so far so [8:34] good next we're going to take this text [8:37] and the text is a sequence of characters [8:39] in Python so when I call the set [8:41] Constructor on it I'm just going to get [8:44] the set of all the characters that occur [8:46] in this text and then I call list on [8:49] that to create a list of those [8:51] characters instead of just a set so that [8:53] I have an ordering an arbitrary ordering [8:56] and then I sort that so basically we get [8:59] just all the characters that occur in [9:00] the entire data set and they're sorted [9:02] now the number of them is going to be [9:04] our vocabulary size these are the [9:06] possible elements of our sequences and [9:09] we see that when I print here the [9:11] characters there's 65 of them in total [9:14] there's a space character and then all [9:16] kinds of special characters and then U [9:19] capitals and lowercase letters so that's [9:21] our vocabulary and that's the sort of [9:23] like possible uh characters that the [9:25] model can see or emit okay so next we [9:29] will would like to develop some strategy [9:31] to tokenize the input text now when [9:35] people say tokenize they mean convert [9:36] the raw text as a string to some [9:39] sequence of integers According to some [9:41] uh notebook According to some vocabulary [9:43] of possible elements so as an example [9:46] here we are going to be building a [9:48] character level language model so we're [9:49] simply going to be translating [9:50] individual characters into integers so [9:53] let me show you uh a chunk of code that [9:55] sort of does that for us so we're [9:57] building both the encoder and the [9:58] decoder [10:00] and let me just talk through what's [10:01] happening [10:02] here when we encode an arbitrary text [10:05] like hi there we're going to receive a [10:08] list of integers that represents that [10:10] string so for example 46 47 Etc and then [10:14] we also have the reverse mapping so we [10:17] can take this list and decode it to get [10:20] back the exact same string so it's [10:22] really just like a translation to [10:24] integers and back for arbitrary string [10:26] and for us it is done on a character [10:28] level [10:30] now the way this was achieved is we just [10:31] iterate over all the characters here and [10:34] create a lookup table from the character [10:35] to the integer and vice versa and then [10:38] to encode some string we simply [10:40] translate all the characters [10:41] individually and to decode it back we [10:44] use the reverse mapping and concatenate [10:46] all of it now this is only one of many [10:49] possible encodings or many possible sort [10:51] of tokenizers and it's a very simple one [10:54] but there's many other schemas that [10:55] people have come up with in practice so [10:57] for example Google uses a sentence [10:59] piece uh so sentence piece will also [11:02] encode text into um integers but in a [11:05] different schema and using a different [11:08] vocabulary and sentence piece is a [11:10] subword uh sort of tokenizer and what [11:13] that means is that um you're not [11:15] encoding entire words but you're not [11:17] also encoding individual characters it's [11:19] it's a subword unit level and that's [11:22] usually what's adopted in practice for [11:24] example also openai has this Library [11:26] called tick token that uses a bite pair [11:28] encode [11:29] tokenizer um and that's what GPT uses [11:33] and you can also just encode words into [11:35] like hell world into a list of integers [11:38] so as an example I'm using the Tik token [11:40] Library here I'm getting the encoding [11:43] for gpt2 or that was used for gpt2 [11:46] instead of just having 65 possible [11:48] characters or tokens they have 50,000 [11:51] tokens and so when they encode the exact [11:54] same string High there we only get a [11:57] list of three integers but those [11:59] integers are not between 0 and 64 they [12:01] are between Z and 5, [12:05] 5,256 so basically you can trade off the [12:09] code book size and the sequence lengths [12:12] so you can have very long sequences of [12:13] integers with very small vocabularies or [12:16] we can have short um sequences of [12:20] integers with very large vocabularies [12:23] and so typically people use in practice [12:25] these subword encodings but I'd like to [12:28] keep our token ier very simple so we're [12:30] using character level tokenizer and that [12:33] means that we have very small code books [12:35] we have very simple encode and decode [12:37] functions uh but we do get very long [12:40] sequences as a result but that's the [12:42] level at which we're going to stick with [12:43] this lecture because it's the simplest [12:45] thing okay so now that we have an [12:46] encoder and a decoder effectively a [12:49] tokenizer we can tokenize the entire [12:51] training set of Shakespeare so here's a [12:53] chunk of code that does that and I'm [12:55] going to start to use the pytorch [12:56] library and specifically the torch. [12:58] tensor from the pytorch library so we're [13:01] going to take all of the text in tiny [13:03] Shakespeare encode it and then wrap it [13:05] into a torch. tensor to get the data [13:08] tensor so here's what the data tensor [13:10] looks like when I look at just the first [13:12] 1,000 characters or the 1,000 elements [13:14] of it so we see that we have a massive [13:16] sequence of integers and this sequence [13:18] of integers here is basically an [13:20] identical translation of the first [13:22] 10,000 characters [13:24] here so I believe for example that zero [13:27] is a new line character and maybe one [13:29] one is a space not 100% sure but from [13:32] now on the entire data set of text is [13:34] re-represented as just it's just [13:35] stretched out as a single very large uh [13:38] sequence of [13:39] integers let me do one more thing before [13:41] we move on here I'd like to separate out [13:43] our data set into a train and a [13:45] validation split so in particular we're [13:48] going to take the first 90% of the data [13:51] set and consider that to be the training [13:52] data for the Transformer and we're going [13:54] to withhold the last 10% at the end of [13:56] it to be the validation data and this [13:59] will help us understand to what extent [14:01] our model is overfitting so we're going [14:03] to basically hide and keep the [14:04] validation data on the side because we [14:06] don't want just a perfect memorization [14:08] of this exact Shakespeare we want a [14:11] neural network that sort of creates [14:12] Shakespeare like uh text and so it [14:15] should be fairly likely for it to [14:17] produce the actual like stowed away uh [14:21] true Shakespeare text um and so we're [14:24] going to use this to uh get a sense of [14:26] the overfitting okay so now we would [14:28] like to start plugging these text [14:30] sequences or integer sequences into the [14:32] Transformer so that it can train and [14:34] learn those patterns now the important [14:36] thing to realize is we're never going to [14:38] actually feed entire text into a [14:40] Transformer all at once that would be [14:42] computationally very expensive and [14:44] prohibitive so when we actually train a [14:46] Transformer on a lot of these data sets [14:48] we only work with chunks of the data set [14:50] and when we train the Transformer we [14:52] basically sample random little chunks [14:53] out of the training set and train on [14:55] just chunks at a time and these chunks [14:58] have basically some kind of a length and [15:01] some maximum length now the maximum [15:04] length typically at least in the code I [15:06] usually write is called block size you [15:08] can you can uh find it under different [15:10] names like context length or something [15:12] like that let's start with the block [15:14] size of just eight and let me look at [15:16] the first train data characters the [15:18] first block size plus one characters [15:20] I'll explain why plus one in a [15:22] second so this is the first nine [15:24] characters in the sequence in the [15:27] training set now what I'd like to point [15:30] out is that when you sample a chunk of [15:31] data like this so say the these nine [15:34] characters out of the training set this [15:36] actually has multiple examples packed [15:38] into it and uh that's because all of [15:41] these characters follow each other and [15:43] so what this thing is going to say when [15:47] we plug it into a Transformer is we're [15:49] going to actually simultaneously train [15:50] it to make prediction at every one of [15:52] these [15:53] positions now in the in a chunk of nine [15:56] characters there's actually eight indiv [15:58] ual examples packed in there so there's [16:01] the example that when 18 when in the [16:04] context of 18 47 likely comes next in a [16:08] context of 18 and 47 56 comes next in a [16:12] context of 18 47 56 57 can come next and [16:16] so on so that's the eight individual [16:18] examples let me actually spell it out [16:20] with [16:21] code so here's a chunk of code to [16:24] illustrate X are the inputs to the [16:26] Transformer it will just be the first [16:28] block size characters y will be the uh [16:32] next block size characters so it's [16:34] offset by one and that's because y are [16:37] the targets for each position in the [16:40] input and then here I'm iterating over [16:42] all the block size of eight and the [16:45] context is always all the characters in [16:47] x uh up to T and including T and the [16:51] target is always the teth character but [16:53] in the targets array y so let me just [16:56] run [16:57] this and basically it spells out what I [16:59] said in words uh these are the eight [17:02] examples hidden in a chunk of nine [17:04] characters that we uh sampled from the [17:08] training set I want to mention one more [17:11] thing we train on all the eight examples [17:14] here with context between one all the [17:16] way up to context of block size and we [17:19] train on that not just for computational [17:20] reasons because we happen to have the [17:22] sequence already or something like that [17:23] it's not just done for efficiency it's [17:26] also done um to make the Transformer [17:28] Network be used to seeing contexts all [17:32] the way from as little as one all the [17:33] way to block size and we'd like the [17:36] transform to be used to seeing [17:38] everything in between and that's going [17:39] to be useful later during inference [17:41] because while we're sampling we can [17:43] start the sampling generation with as [17:45] little as one character of context and [17:47] the Transformer knows how to predict the [17:49] next character with all the way up to [17:51] just context of one and so then it can [17:53] predict everything up to block size and [17:55] after block size we have to start [17:56] truncating because the Transformer will [17:58] will never um receive more than block [18:01] size inputs when it's predicting the [18:03] next [18:03] character Okay so we've looked at the [18:06] time dimension of the tensors that are [18:07] going to be feeding into the Transformer [18:09] there's one more Dimension to care about [18:11] and that is the batch Dimension and so [18:13] as we're sampling these chunks of text [18:15] we're going to be actually every time [18:17] we're going to feed them into a [18:18] Transformer we're going to have many [18:20] batches of multiple chunks of text that [18:22] are all like stacked up in a single [18:23] tensor and that's just done for [18:25] efficiency just so that we can keep the [18:27] gpus busy uh because they are very good [18:29] at parallel processing of um of data and [18:33] so we just want to process multiple [18:35] chunks all at the same time but those [18:37] chunks are processed completely [18:38] independently they don't talk to each [18:39] other and so on so let me basically just [18:42] generalize this and introduce a batch [18:44] Dimension here's a chunk of [18:46] code let me just run it and then I'm [18:48] going to explain what it [18:50] does so here because we're going to [18:52] start sampling random locations in the [18:54] data set to pull chunks from I am [18:57] setting the seed so that um in the [19:00] random number generator so that the [19:01] numbers I see here are going to be the [19:02] same numbers you see later if you try to [19:04] reproduce this now the batch size here [19:07] is how many independent sequences we are [19:09] processing every forward backward pass [19:11] of the [19:12] Transformer the block size as I [19:14] explained is the maximum context length [19:16] to make those predictions so let's say B [19:19] size four block size eight and then [19:21] here's how we get batch for any [19:23] arbitrary split if the split is a [19:25] training split then we're going to look [19:26] at train data otherwise at valid data [19:30] that gives us the data array and then [19:33] when I Generate random positions to grab [19:35] a chunk out of I actually grab I [19:38] actually generate batch size number of [19:41] Random offsets so because this is four [19:44] we are ex is going to be a uh four [19:47] numbers that are randomly generated [19:49] between zero and Len of data minus block [19:51] size so it's just random offsets into [19:53] the training [19:54] set and then X's as I explained are the [19:58] first first block size characters [20:00] starting at I the Y's are the offset by [20:05] one of that so just add plus one and [20:08] then we're going to get those chunks for [20:10] every one of integers I INX and use a [20:14] torch. stack to take all those uh uh [20:17] one-dimensional tensors as we saw here [20:20] and we're going to um stack them up at [20:24] rows and so they all become a row in a [20:27] 4x8 tensor [20:29] so here's where I'm printing then when I [20:32] sample a batch XB and YB the inputs to [20:35] the Transformer now are the input X is [20:39] the 4x8 tensor four uh rows of eight [20:44] columns and each one of these is a chunk [20:47] of the training [20:48] set and then the targets here are in the [20:52] associated array Y and they will come in [20:54] to the Transformer all the way at the [20:55] end uh to um create the loss function [20:59] uh so they will give us the correct [21:01] answer for every single position inside [21:03] X and then these are the four [21:06] independent [21:07] rows so spelled out as we did [21:11] before uh this 4x8 array contains a [21:14] total of 32 examples and they're [21:17] completely independent as far as the [21:19] Transformer is [21:20] concerned uh so when the input is 24 the [21:25] target is 43 or rather 43 here in the Y [21:28] array [21:29] when the input is 2443 the target is [21:31] 58 uh when the input is 24 43 58 the [21:34] target is 5 Etc or like when it is a 52 [21:38] 581 the target is [21:40] 58 right so you can sort of see this [21:43] spelled out these are the 32 independent [21:45] examples packed in to a single batch of [21:48] the input X and then the desired targets [21:51] are in y and so now this integer tensor [21:57] of um X is going to feed into the [22:00] Transformer and that Transformer is [22:02] going to simultaneously process all [22:04] these examples and then look up the [22:06] correct um integers to predict in every [22:08] one of these positions in the tensor y [22:11] okay so now that we have our batch of [22:13] input that we'd like to feed into a [22:15] Transformer let's start basically [22:16] feeding this into neural networks now [22:19] we're going to start off with the [22:20] simplest possible neural network which [22:22] in the case of language modeling in my [22:23] opinion is the Byram language model and [22:25] we've covered the Byram language model [22:26] in my make more series in a lot of depth [22:29] and so here I'm going to sort of go [22:31] faster and let's just Implement pytorch [22:33] module directly that implements the byr [22:36] language [22:36] model so I'm importing the pytorch um NN [22:41] module uh for [22:43] reproducibility and then here I'm [22:44] constructing a Byram language model [22:46] which is a subass of NN [22:48] module and then I'm calling it and I'm [22:51] passing it the inputs and the targets [22:53] and I'm just printing now when the [22:55] inputs on targets come here you see that [22:57] I'm just taking the index uh the inputs [23:00] X here which I rename to idx and I'm [23:03] just passing them into this token [23:04] embedding table so it's going on here is [23:07] that here in the Constructor we are [23:09] creating a token embedding table and it [23:12] is of size vocap size by vocap [23:15] size and we're using an. embedding which [23:18] is a very thin wrapper around basically [23:20] a tensor of shape voap size by vocab [23:23] size and what's happening here is that [23:25] when we pass idx here every single [23:28] integer in our input is going to refer [23:30] to this embedding table and it's going [23:32] to pluck out a row of that embedding [23:34] table corresponding to its index so 24 [23:37] here will go into the embedding table [23:39] and we'll pluck out the 24th row and [23:42] then 43 will go here and pluck out the [23:44] 43d row Etc and then pytorch is going to [23:47] arrange all of this into a batch by Time [23:50] by channel uh tensor in this case batch [23:53] is four time is eight and C which is the [23:57] channels is vocab size or 65 and so [24:01] we're just going to pluck out all those [24:02] rows arrange them in a b by T by C and [24:05] now we're going to interpret this as the [24:07] logits which are basically the scores [24:10] for the next character in the sequence [24:12] and so what's happening here is we are [24:14] predicting what comes next based on just [24:17] the individual identity of a single [24:19] token and you can do that because um I [24:22] mean currently the tokens are not [24:23] talking to each other and they're not [24:25] seeing any context except for they're [24:26] just seeing themselves so I'm a f I'm a [24:29] token number five and then I can [24:32] actually make pretty decent predictions [24:33] about what comes next just by knowing [24:35] that I'm token five because some [24:37] characters uh know um C follow other [24:39] characters in in typical scenarios so we [24:42] saw a lot of this in a lot more depth in [24:44] the make more series and here if I just [24:46] run this then we currently get the [24:49] predictions the scores the lits for [24:53] every one of the 4x8 positions now that [24:55] we've made predictions about what comes [24:57] next we'd like to evaluate the loss [24:58] function and so in make more series we [25:00] saw that a good way to measure a loss or [25:03] like a quality of the predictions is to [25:05] use the negative log likelihood loss [25:07] which is also implemented in pytorch [25:09] under the name cross entropy so what we' [25:12] like to do here is loss is the cross [25:15] entropy on the predictions and the [25:17] targets and so this measures the quality [25:20] of the logits with respect to the [25:21] Targets in other words we have the [25:24] identity of the next character so how [25:26] well are we predicting the next [25:28] character based on the lits and [25:30] intuitively the correct um the correct [25:33] dimension of low jits uh depending on [25:36] whatever the target is should have a [25:38] very high number and all the other [25:39] dimensions should be very low number [25:41] right now the issue is that this won't [25:44] actually this is what we want we want to [25:46] basically output the logits and the [25:50] loss this is what we want but [25:52] unfortunately uh this won't actually run [25:55] we get an error message but intuitively [25:57] we want to uh measure this now when we [26:01] go to the pytorch um cross entropy [26:04] documentation here um we're trying to [26:08] call the cross entropy in its functional [26:10] form uh so that means we don't have to [26:11] create like a module for it but here [26:14] when we go to the documentation you have [26:16] to look into the details of how pitor [26:18] expects these inputs and basically the [26:20] issue here is ptor expects if you have [26:24] multi-dimensional input which we do [26:25] because we have a b BYT by C tensor then [26:28] it actually really wants the channels to [26:31] be the second uh Dimension here so if [26:35] you um so basically it wants a b by C [26:38] BYT instead of a b by T by C and so it's [26:42] just the details of how P torch treats [26:45] um these kinds of inputs and so we don't [26:49] actually want to deal with that so what [26:51] we're going to do instead is we need to [26:52] basically reshape our logits so here's [26:54] what I like to do I like to take [26:56] basically give names to the dimensions [26:58] so lit. shape is B BYT by C and unpack [27:01] those numbers and then let's uh say that [27:04] logits equals lit. View and we want it [27:07] to be a b * c b * T by C so just a two- [27:11] dimensional [27:12] array right so we're going to take all [27:15] the we're going to take all of these um [27:18] positions here and we're going to uh [27:20] stretch them out in a onedimensional [27:22] sequence and uh preserve the channel [27:25] Dimension as the second [27:26] dimension so we're just kind of like [27:28] stretching out the array so it's two- [27:29] dimensional and in that case it's going [27:31] to better conform to what pytorch uh [27:33] sort of expects in its Dimensions now we [27:36] have to do the same to targets because [27:38] currently targets are um of shape B by T [27:44] and we want it to be just B * T so [27:47] onedimensional now alternatively you [27:49] could always still just do minus one [27:51] because pytor will guess what this [27:53] should be if you want to lay it out uh [27:55] but let me just be explicit and say p * [27:57] t once we've reshaped this it will match [28:00] the cross entropy case and then we [28:03] should be able to evaluate our [28:06] loss okay so that R now and we can do [28:10] loss and So currently we see that the [28:12] loss is [28:13] 4.87 now because our uh we have 65 [28:17] possible vocabulary elements we can [28:19] actually guess at what the loss should [28:20] be and in [28:22] particular we covered negative log [28:24] likelihood in a lot of detail we are [28:26] expecting log or lawn of um 1 over 65 [28:32] and negative of that so we're expecting [28:34] the loss to be about 4.1 17 but we're [28:37] getting 4.87 and so that's telling us [28:40] that the initial predictions are not uh [28:42] super diffuse they've got a little bit [28:43] of entropy and so we're guessing wrong [28:47] uh so uh yes but actually we're I a we [28:50] are able to evaluate the loss okay so [28:53] now that we can evaluate the quality of [28:54] the model on some data we'd like to also [28:57] be able to generate from the model so [28:59] let's do the generation now I'm going to [29:01] go again a little bit faster here [29:03] because I covered all this already in [29:04] previous [29:05] videos [29:07] so here's a generate function for the [29:11] model so we take some uh we take the the [29:15] same kind of input idx here and [29:18] basically this is the current uh context [29:22] of some characters in a batch in some [29:24] batch so it's also B BYT and the job of [29:28] generate is to basically take this B BYT [29:30] and extend it to be B BYT + 1 plus 2 [29:32] plus 3 and so it's just basically it [29:34] continues the generation in all the [29:36] batch dimensions in the time Dimension [29:39] So that's its job and it will do that [29:41] for Max new tokens so you can see here [29:43] on the bottom there's going to be some [29:45] stuff here but on the bottom whatever is [29:47] predicted is concatenated on top of the [29:50] previous idx along the First Dimension [29:53] which is the time Dimension to create a [29:54] b BYT + one so that becomes a new idx so [29:58] the job of generate is to take a b BYT [30:00] and make it a b BYT plus 1 plus 2 plus [30:02] three as many as we want Max new tokens [30:05] so this is the generation from the model [30:08] now inside the generation what what are [30:10] we doing we're taking the current [30:11] indices we're getting the predictions so [30:15] we get uh those are in the low jits and [30:18] then the loss here is going to be [30:19] ignored because um we're not we're not [30:21] using that and we have no targets that [30:23] are sort of ground truth targets that [30:25] we're going to be comparing with [30:28] then once we get the logits we are only [30:30] focusing on the last step so instead of [30:33] a b by T by C we're going to pluck out [30:36] the negative-1 the last element in the [30:38] time Dimension because those are the [30:40] predictions for what comes next so that [30:42] gives us the logits which we then [30:44] convert to probabilities via softmax and [30:47] then we use tor. multinomial to sample [30:49] from those probabilities and we ask [30:51] pytorch to give us one sample and so idx [30:54] next will become a b by one because in [30:57] each uh one of the batch Dimensions [31:00] we're going to have a single prediction [31:01] for what comes next so this num samples [31:03] equals one will make this be a [31:06] one and then we're going to take those [31:08] integers that come from the sampling [31:10] process according to the probability [31:11] distribution given here and those [31:13] integers got just concatenated on top of [31:15] the current sort of like running stream [31:17] of integers and this gives us a b BYT + [31:20] one and then we can return that now one [31:24] thing here is you see how I'm calling [31:26] self of idx which will end up going to [31:29] the forward function I'm not providing [31:31] any Targets So currently this would give [31:33] an error because targets is uh is uh [31:36] sort of like not given so targets has to [31:39] be optional so targets is none by [31:41] default and then if targets is none then [31:44] there's no loss to create so it's just [31:47] loss is none but else all of this [31:50] happens and we can create a loss so this [31:53] will make it so um if we have the [31:56] targets we provide them and get a loss [31:57] if we have no targets it will'll just [31:59] get the [32:00] loits so this here will generate from [32:02] the model um and let's take that for a [32:06] ride [32:08] now oops so I have another code chunk [32:11] here which will generate for the model [32:13] from the model and okay this is kind of [32:15] crazy so maybe let me let me break this [32:18] down so these are the idx [32:23] right I'm creating a batch will be just [32:26] one time will be just one so I'm [32:30] creating a little one by one tensor and [32:32] it's holding a zero and the D type the [32:35] data type is uh integer so zero is going [32:38] to be how we kick off the generation and [32:40] remember that zero is uh is the element [32:44] standing for a new line character so [32:45] it's kind of like a reasonable thing to [32:47] to feed in as the very first character [32:49] in a sequence to be the new [32:51] line um so it's going to be idx which [32:54] we're going to feed in here then we're [32:56] going to ask for 100 tokens [32:58] and then. generate will continue that [33:01] now because uh generate works on the [33:05] level of batches we we then have to [33:07] index into the zero throw to basically [33:09] unplug the um the single batch Dimension [33:13] that exists and then that gives us a um [33:18] time steps just a onedimensional array [33:20] of all the indices which we will convert [33:23] to simple python list from pytorch [33:26] tensor so that that can feed into our [33:28] decode function and uh convert those [33:32] integers into text so let me bring this [33:34] back and we're generating 100 tokens [33:37] let's [33:37] run and uh here's the generation that we [33:40] achieved so obviously it's garbage and [33:43] the reason it's garbage is because this [33:44] is a totally random model so next up [33:47] we're going to want to train this model [33:49] now one more thing I wanted to point out [33:50] here is this function is written to be [33:53] General but it's kind of like ridiculous [33:55] right now because [33:58] we're feeding in all this we're building [33:59] out this context and we're concatenating [34:02] it all and we're always feeding it all [34:05] into the model but that's kind of [34:07] ridiculous because this is just a simple [34:09] Byram model so to make for example this [34:11] prediction about K we only needed this W [34:14] but actually what we fed into the model [34:15] is we fed the entire sequence and then [34:18] we only looked at the very last piece [34:20] and predicted K so the only reason I'm [34:23] writing it in this way is because right [34:25] now this is a byr model but I'd like to [34:27] keep keep this function fixed and I'd [34:29] like it to work um later when our [34:32] characters actually um basically look [34:35] further in the history and so right now [34:37] the history is not used so this looks [34:39] silly uh but eventually the history will [34:42] be used and so that's why we want to uh [34:44] do it this way so just a quick comment [34:46] on that so now we see that this is um [34:49] random so let's train the model so it [34:51] becomes a bit less random okay let's Now [34:53] train the model so first what I'm going [34:55] to do is I'm going to create a pyour [34:57] optimization object so here we are using [35:00] the optimizer ATM W um now in a make [35:05] more series we've only ever use tastic [35:06] gradi in descent the simplest possible [35:08] Optimizer which you can get using the [35:10] SGD instead but I want to use Adam which [35:12] is a much more advanced and popular [35:14] Optimizer and it works extremely well [35:16] for uh typical good setting for the [35:19] learning rate is roughly 3 E4 uh but for [35:22] very very small networks like is the [35:23] case here you can get away with much [35:25] much higher learning rates R3 or even [35:28] higher probably but let me create the [35:30] optimizer object which will basically [35:33] take the gradients and uh update the [35:35] parameters using the [35:36] gradients and then here our batch size [35:40] up above was only four so let me [35:41] actually use something bigger let's say [35:43] 32 and then for some number of steps um [35:46] we are sampling a new batch of data [35:48] we're evaluating the loss uh we're [35:51] zeroing out all the gradients from the [35:52] previous step getting the gradients for [35:54] all the parameters and then using those [35:56] gradients to up update our parameters so [35:58] typical training loop as we saw in the [36:00] make more series so let me now uh run [36:04] this for say 100 iterations and let's [36:07] see what kind of losses we're going to [36:09] get so we started around [36:12] 4.7 and now we're getting to down to [36:14] like 4.6 4.5 Etc so the optimization is [36:18] definitely happening but um let's uh [36:22] sort of try to increase number of [36:23] iterations and only print at the [36:25] end because we probably want train for [36:29] longer okay so we're down to 3.6 [36:34] roughly roughly down to [36:40] three this is the most janky [36:46] optimization okay it's working let's [36:48] just do [36:50] 10,000 and then from here we want to [36:53] copy this and hopefully that we're going [36:56] to get something reason and of course [36:58] it's not going to be Shakespeare from a [37:00] byr model but at least we see that the [37:01] loss is improving and uh hopefully we're [37:05] expecting something a bit more [37:06] reasonable okay so we're down at about [37:08] 2.5 is let's see what we get okay [37:12] dramatic improvements certainly on what [37:14] we had here so let me just increase the [37:17] number of tokens okay so we see that [37:19] we're starting to get something at least [37:21] like reasonable is [37:25] um certainly not shakes spear but uh the [37:29] model is making progress so that is the [37:31] simplest possible [37:33] model so now what I'd like to do [37:36] is obviously this is a very simple model [37:39] because the tokens are not talking to [37:41] each other so given the previous context [37:43] of whatever was generated we're only [37:45] looking at the very last character to [37:46] make the predictions about what comes [37:48] next so now these uh now these tokens [37:50] have to start talking to each other and [37:53] figuring out what is in the context so [37:55] that they can make better predictions [37:56] for what comes next and this is how [37:57] we're going to kick off the uh [37:59] Transformer okay so next I took the code [38:02] that we developed in this juper notebook [38:03] and I converted it to be a script and [38:05] I'm doing this because I just want to [38:08] simplify our intermediate work into just [38:10] the final product that we have at this [38:12] point so in the top here I put all the [38:15] hyp parameters that we to find I [38:16] introduced a few and I'm going to speak [38:18] to that in a little bit otherwise a lot [38:20] of this should be recognizable uh [38:23] reproducibility read data get the [38:25] encoder and the decoder create the train [38:27] into splits uh use the uh kind of like [38:30] data loader um that gets a batch of the [38:34] inputs and Targets this is new and I'll [38:36] talk about it in a second now this is [38:39] the Byram language model that we [38:40] developed and it can forward and give us [38:43] a logits and loss and it can [38:45] generate and then here we are creating [38:48] the optimizer and this is the training [38:51] Loop so everything here should look [38:53] pretty familiar now some of the small [38:55] things that I added number one I added [38:57] the ability to run on a GPU if you have [39:00] it so if you have a GPU then you can [39:02] this will use Cuda instead of just CPU [39:04] and everything will be a lot more faster [39:07] now when device becomes Cuda then we [39:09] need to make sure that when we load the [39:11] data we move it to [39:13] device when we create the model we want [39:15] to move uh the model parameters to [39:18] device so as an example here we have the [39:21] N an embedding table and it's got a [39:23] weight inside it which stores the uh [39:26] sort of lookup table so so that would be [39:27] moved to the GPU so that all the [39:29] calculations here happen on the GPU and [39:32] they can be a lot faster and then [39:34] finally here when I'm creating the [39:35] context that feeds in to generate I have [39:37] to make sure that I create it on the [39:39] device number two what I introduced is [39:43] uh the fact that here in the training [39:46] Loop here I was just printing the um l. [39:50] item inside the training Loop but this [39:53] is a very noisy measurement of the [39:54] current loss because every batch will be [39:56] more or less lucky and so what I want to [39:59] do usually um is uh I have an estimate [40:02] loss function and the estimate loss [40:05] basically then um goes up here and it [40:10] averages up the loss over multiple [40:12] batches so in particular we're going to [40:15] iterate eval iter times and we're going [40:17] to basically get our loss and then we're [40:19] going to get the average loss for both [40:21] splits and so this will be a lot less [40:24] noisy so here when we call the estimate [40:26] loss we're we're going to report the uh [40:28] pretty accurate train and validation [40:31] loss now when we come back up you'll [40:33] notice a few things here I'm setting the [40:35] model to evaluation phase and down here [40:38] I'm resetting it back to training phase [40:40] now right now for our model as is this [40:42] doesn't actually do anything because the [40:44] only thing inside this model is this uh [40:46] nn. embedding and um this this um [40:51] Network would behave both would behave [40:53] the same in both evaluation mode and [40:55] training mode we have no drop off layers [40:57] we have no batm layers Etc but it is a [41:00] good practice to Think Through what mode [41:02] your neural network is in because some [41:04] layers will have different Behavior Uh [41:07] at inference time or training time and [41:11] there's also this context manager torch [41:12] up nograd and this is just telling [41:14] pytorch that everything that happens [41:16] inside this function we will not call do [41:18] backward on and so pytorch can be a lot [41:21] more efficient with its memory use [41:23] because it doesn't have to store all the [41:25] intermediate variables uh because we're [41:27] never going to call backward and so it [41:29] can it can be a lot more memory [41:30] efficient in that way so also a good [41:32] practice to tpy torch when we don't [41:35] intend to do back [41:36] propagation so right now this script is [41:39] about 120 lines of code of and that's [41:43] kind of our starter code I'm calling it [41:45] b.p and I'm going to release it later [41:48] now running this [41:50] script gives us output in the terminal [41:52] and it looks something like this it [41:54] basically as I ran this code uh it was [41:57] giving me the train loss and Val loss [41:59] and we see that we convert to somewhere [42:01] around [42:01] 2.5 with the pyr model and then here's [42:04] the sample that we produced at the [42:07] end and so we have everything packaged [42:09] up in the script and we're in a good [42:11] position now to iterate on this okay so [42:13] we are almost ready to start writing our [42:15] very first self attention block for [42:18] processing these uh tokens now before we [42:22] actually get there I want to get you [42:24] used to a mathematical trick that is [42:26] used in the self attention inside a [42:28] Transformer and is really just like at [42:30] the heart of an an efficient [42:32] implementation of self attention and so [42:34] I want to work with this toy example to [42:36] just get you used to this operation and [42:38] then it's going to make it much more [42:39] clear once we actually get to um to it [42:43] uh in the script [42:44] again so let's create a b BYT by C where [42:47] BT and C are just 48 and two in the toy [42:50] example and these are basically channels [42:53] and we have uh batches and we have the [42:55] time component and we have information [42:58] at each point in the sequence so [43:01] see now what we would like to do is we [43:03] would like these um tokens so we have up [43:06] to eight tokens here in a batch and [43:08] these eight tokens are currently not [43:10] talking to each other and we would like [43:11] them to talk to each other we'd like to [43:13] couple them and in particular we don't [43:17] we we want to couple them in a very [43:18] specific way so the token for example at [43:21] the fifth location it should not [43:23] communicate with tokens in the sixth [43:25] seventh and eighth location [43:27] because uh those are future tokens in [43:29] the sequence the token on the fifth [43:31] location should only talk to the one in [43:33] the fourth third second and first so [43:36] it's only so information only flows from [43:38] previous context to the current time [43:40] step and we cannot get any information [43:42] from the future because we are about to [43:44] try to predict the [43:45] future so what is the easiest way for [43:49] tokens to communicate okay the easiest [43:52] way I would say is okay if we're up to [43:54] if we're a fifth token and I'd like to [43:56] communicate with my past the simplest [43:58] way we can do that is to just do a [44:00] weight is to just do an average of all [44:03] the um of all the preceding elements so [44:06] for example if I'm the fif token I would [44:08] like to take the channels uh that make [44:10] up that are information at my step but [44:13] then also the channels from the fourth [44:15] step third step second step and the [44:17] first step I'd like to average those up [44:19] and then that would become sort of like [44:21] a feature Vector that summarizes me in [44:23] the context of my history now of course [44:26] just doing a sum or like an average is [44:28] an extremely weak form of interaction [44:30] like this communication is uh extremely [44:32] lossy we've lost a ton of information [44:34] about the spatial Arrangements of all [44:35] those tokens uh but that's okay for now [44:38] we'll see how we can bring that [44:39] information back later for now what we [44:41] would like to do is for every single [44:43] batch element independently for every [44:46] teeth token in that sequence we'd like [44:49] to now calculate the average of all the [44:53] vectors in all the previous tokens and [44:55] also at this token so let's write that [44:58] out um I have a small snippet here and [45:01] instead of just fumbling around let me [45:03] just copy paste it and talk to [45:05] it so in other words we're going to [45:08] create X and B is short for bag of words [45:12] because bag of words is um is kind of [45:15] like um a term that people use when you [45:17] are just averaging up things so this is [45:19] just a bag of words basically there's a [45:21] word stored on every one of these eight [45:23] locations and we're doing a bag of words [45:25] we're just averaging [45:27] so in the beginning we're going to say [45:28] that it's just initialized at Zero and [45:30] then I'm doing a for Loop here so we're [45:32] not being efficient yet that's coming [45:34] but for now we're just iterating over [45:36] all the batch Dimensions independently [45:38] iterating over time and then the [45:40] previous uh tokens are at this uh batch [45:45] Dimension and then everything up to and [45:47] including the teeth token okay so when [45:51] we slice out X in this way X prev [45:54] Becomes of shape um how many T elements [45:58] there were in the past and then of [46:00] course C so all the two-dimensional [46:02] information from these little tokens so [46:05] that's the previous uh sort of chunk of [46:08] um tokens from my current sequence and [46:12] then I'm just doing the average or the [46:13] mean over the zero Dimension so I'm [46:15] averaging out the time here and I'm just [46:19] going to get a little c one dimensional [46:21] Vector which I'm going to store in X bag [46:23] of words so I can run this and and uh [46:27] this is not going to be very informative [46:30] because let's see so this is X of Zer so [46:32] this is the zeroth batch element and [46:35] then expo at zero now you see how the at [46:40] the first location here you see that the [46:42] two are equal and that's because it's [46:45] we're just doing an average of this one [46:46] token but here this one is now an [46:49] average of these two and now this one is [46:53] an average of these [46:54] three and so on [46:57] so uh and this last one is the average [47:01] of all of these elements so vertical [47:03] average just averaging up all the tokens [47:05] now gives this outcome [47:07] here so this is all well and good uh but [47:10] this is very inefficient now the trick [47:12] is that we can be very very efficient [47:14] about doing this using matrix [47:16] multiplication so that's the [47:18] mathematical trick and let me show you [47:19] what I mean let's work with the toy [47:21] example here let me run it and I'll [47:24] explain I have a simple Matrix here that [47:27] is a 3X3 of all ones a matrix B of just [47:31] random numbers and it's a 3x2 and a [47:33] matrix C which will be 3x3 multip 3x2 [47:36] which will give out a 3x2 so here we're [47:39] just using um matrix multiplication so a [47:43] multiply B gives us [47:46] C okay so how are these numbers in C um [47:51] achieved right so this number in the top [47:54] left is the first row of a dot product [47:57] with the First Column of B and since all [48:00] the the row of a right now is all just [48:02] ones then the do product here with with [48:05] this column of B is just going to do a [48:07] sum of these of this column so 2 + 6 + 6 [48:11] is [48:12] 14 the element here in the output of C [48:15] is also the first column here the first [48:17] row of a multiplied now with the second [48:20] column of B so 7 + 4 + 5 is 16 now you [48:25] see that there's repeating elements here [48:26] so this 14 again is because this row is [48:28] again all ones and it's multiplying the [48:30] First Column of B so we get 14 and this [48:33] one is and so on so this last number [48:35] here is the last row do product last [48:39] column now the trick here is uh the [48:42] following this is just a boring number [48:44] of um it's just a boring array of all [48:48] ones but torch has this function called [48:50] Trail which is short for a [48:54] triangular uh something like that and [48:56] you can wrap it in torch up once and it [48:58] will just return the lower triangular [49:00] portion of this [49:03] okay so now it will basically zero out [49:06] uh these guys here so we just get the [49:08] lower triangular part well what happens [49:10] if we do [49:14] that so now we'll have a like this and B [49:17] like this and now what are we getting [49:18] here in C well what is this number well [49:22] this is the first row times the First [49:24] Column and because this is zeros [49:28] uh these elements here are now ignored [49:30] so we just get a two and then this [49:32] number here is the first row times the [49:35] second column and because these are [49:37] zeros they get ignored and it's just [49:39] seven this seven multiplies this one but [49:42] look what happened here because this is [49:43] one and then zeros we what ended up [49:46] happening is we're just plucking out the [49:48] row of this row of B and that's what we [49:51] got now here we have one 1 Z so here 110 [49:57] do product with these two columns will [49:59] now give us 2 + 6 which is 8 and 7 + 4 [50:02] which is 11 and because this is 111 we [50:05] ended up with the addition of all of [50:07] them and so basically depending on how [50:10] many ones and zeros we have here we are [50:12] basically doing a sum currently of a [50:16] variable number of these rows and that [50:18] gets deposited into [50:20] C So currently we're doing sums because [50:23] these are ones but we can also do [50:25] average right and you can start to see [50:27] how we could do average uh of the rows [50:29] of B uh sort of in an incremental [50:32] fashion because we don't have to we can [50:35] basically normalize these rows so that [50:37] they sum to one and then we're going to [50:39] get an average so if we took a and then [50:41] we did aals [50:43] aide torch. sum in the um of a in the um [50:51] oneth Dimension and then let's keep them [50:55] as true so so therefore the broadcasting [50:57] will work out so if I rerun this you see [51:00] now that these rows now sum to one so [51:04] this row is one this row is 0. 5.5 Z and [51:07] here we get 1/3 and now when we do a [51:09] multiply B what are we getting here we [51:12] are just getting the first row first row [51:15] here now we are getting the average of [51:18] the first two [51:20] rows okay so 2 and six average is four [51:23] and four and seven average is [51:25] 5.5 and on the bottom here we are now [51:27] getting the average of these three rows [51:31] so the average of all of elements of B [51:33] are now deposited here and so you can [51:36] see that by manipulating these uh [51:40] elements of this multiplying Matrix and [51:42] then multiplying it with any given [51:44] Matrix we can do these averages in this [51:47] incremental fashion because we just get [51:50] um and we can manipulate that based on [51:53] the elements of a okay so that's very [51:55] convenient so let's let's swing back up [51:57] here and see how we can vectorize this [51:59] and make it much more efficient using [52:00] what we've learned so in [52:03] particular we are going to produce an [52:05] array a but here I'm going to call it we [52:08] short for weights but this is our [52:11] a and this is how much of every row we [52:14] want to average up and it's going to be [52:17] an average because you can see that [52:18] these rows sum to [52:20] one so this is our a and then our B in [52:23] this example of course is X [52:27] so what's going to happen here now is [52:29] that we are going to have an expo [52:31] 2 and this Expo 2 is going to be way [52:36] multiplying [52:38] RX so let's think this true way is T BYT [52:42] and this is Matrix multiplying in [52:44] pytorch a b by T by [52:47] C and it's giving us uh different what [52:50] shape so pytorch will come here and it [52:52] will see that these shapes are not the [52:54] same so it will create a batch Dimension [52:57] here and this is a batched matrix [53:00] multiply and so it will apply this [53:02] matrix multiplication in all the batch [53:04] elements um in parallel and individually [53:08] and then for each batch element there [53:09] will be a t BYT multiplying T by C [53:12] exactly as we had [53:15] below so this will now create B by T by [53:20] C and Expo 2 will now become identical [53:24] to Expo [53:28] so we can see that torch. all close of [53:32] xbo and xbo 2 should be true [53:36] now so this kind of like convinces us [53:38] that uh these are in fact um the same so [53:43] xbo and xbo 2 if I just print [53:47] them uh okay we're not going to be able [53:49] to okay we're not going to be able to [53:51] just stare it down but [53:54] um well let me try Expo basically just [53:56] at the zeroth element and Expo two at [53:58] the zeroth element so just the first [53:59] batch and we should see that this and [54:02] that should be identical which they [54:04] are right so what happened here the [54:07] trick is we were able to use batched [54:09] Matrix multiply to do this uh [54:12] aggregation really and it's a weighted [54:15] aggregation and the weights are [54:17] specified in this um T BYT array and [54:21] we're basically doing weighted sums and [54:24] uh these weighted sums are are U [54:26] according to uh the weights inside here [54:28] they take on sort of this triangular [54:31] form and so that means that a token at [54:33] the teth dimension will only get uh sort [54:36] of um information from the um tokens [54:39] perceiving it so that's exactly what we [54:41] want and finally I would like to rewrite [54:43] it in one more way and we're going to [54:46] see why that's useful so this is the [54:48] third version and it's also identical to [54:50] the first and second but let me talk [54:53] through it it uses [54:54] softmax so Trill here is this Matrix [55:00] lower triangular [55:01] ones way begins as all [55:05] zero okay so if I just print way in the [55:07] beginning it's all zero then I [55:11] used masked fill so what this is doing [55:15] is we. masked fill it's all zeros and [55:18] I'm saying for all the elements where [55:20] Trill is equal equal Z make them be [55:23] negative Infinity so all the elements [55:26] where Trill is zero will become negative [55:28] Infinity now so this is what we get and [55:32] then the final line here is [55:36] softmax so if I take a softmax along [55:38] every single so dim is negative one so [55:40] along every single row if I do softmax [55:44] what is that going to [55:46] do well softmax is um is also like a [55:51] normalization operation right and so [55:54] spoiler alert you get the exact same [55:58] Matrix let me bring back to [56:00] softmax and recall that in softmax we're [56:02] going to exponentiate every single one [56:04] of these and then we're going to divide [56:06] by the sum and so if we exponentiate [56:10] every single element here we're going to [56:11] get a one and here we're going to get uh [56:14] basically zero 0 z0 Z everywhere else [56:17] and then when we normalize we just get [56:19] one here we're going to get one one and [56:21] then zeros and then softmax will again [56:24] divide and this will give us 5.5 and so [56:27] on and so this is also the uh the same [56:30] way to produce uh this mask now the [56:33] reason that this is a bit more [56:34] interesting and the reason we're going [56:36] to end up using it in self [56:37] attention is that these weights here [56:41] begin uh with zero and you can think of [56:44] this as like an interaction strength or [56:46] like an affinity so basically it's [56:49] telling us how much of each uh token [56:52] from the past do we want to Aggregate [56:54] and average up [56:57] and then this line is saying tokens from [56:59] the past cannot communicate by setting [57:02] them to negative Infinity we're saying [57:04] that we will not aggregate anything from [57:06] those [57:07] tokens and so basically this then goes [57:09] through softmax and through the weighted [57:11] and this is the aggregation through [57:12] matrix [57:14] multiplication and so what this is now [57:16] is you can think of these as um these [57:19] zeros are currently just set by us to be [57:21] zero but a quick preview is that these [57:25] affinities between the tokens are not [57:27] going to be just constant at zero [57:29] they're going to be data dependent these [57:31] tokens are going to start looking at [57:32] each other and some tokens will find [57:34] other tokens more or less interesting [57:37] and depending on what their values are [57:39] they're going to find each other [57:41] interesting to different amounts and I'm [57:42] going to call those affinities I think [57:45] and then here we are saying the future [57:47] cannot communicate with the past we're [57:49] we're going to clamp them and then when [57:51] we normalize and sum we're going to [57:53] aggregate uh sort of their values [57:56] depending on how interesting they find [57:57] each other and so that's the preview for [57:59] self attention and basically long story [58:03] short from this entire section is that [58:05] you can do weighted aggregations of your [58:07] past [58:08] Elements by having by using matrix [58:12] multiplication of a lower triangular [58:14] fashion and then the elements here in [58:17] the lower triangular part are telling [58:18] you how much of each element uh fuses [58:21] into this position so we're going to use [58:24] this trick now to develop the self [58:25] attention block block so first let's get [58:27] some quick preliminaries out of the way [58:30] first the thing I'm kind of bothered by [58:31] is that you see how we're passing in [58:33] vocap size into the Constructor there's [58:35] no need to do that because vocap size is [58:36] already defined uh up top as a global [58:38] variable so there's no need to pass this [58:40] stuff [58:41] around next what I want to do is I don't [58:44] want to actually create I want to create [58:46] like a level of indirection here where [58:47] we don't directly go to the embedding [58:49] for the um logits but instead we go [58:52] through this intermediate phase because [58:54] we're going to start making that bigger [58:57] so let me introduce a new variable n [58:59] embed it shorted for number of embedding [59:02] Dimensions so [59:04] nbed here will be say 32 that was a [59:09] suggestion from GitHub co-pilot by the [59:11] way um it also suest 32 which is a good [59:14] number so this is an embedding table and [59:16] only 32 dimensional [59:18] embeddings so then here this is not [59:21] going to give us logits directly instead [59:23] this is going to give us token [59:24] embeddings that's I'm going to call it [59:27] and then to go from the token Tings to [59:29] the logits we're going to need a linear [59:30] layer so self. LM head let's call it [59:34] short for language modeling head is n [59:36] and linear from n ined up to vocap size [59:39] and then when we swing over here we're [59:41] actually going to get the loits by [59:43] exactly what the co-pilot says now we [59:46] have to be careful here because this C [59:48] and this C are not equal um this is nmed [59:52] C and this is vocap size so let's just [59:55] say that n ined is equal to [59:57] C and then this just creates one spous [60:01] layer of interaction through a linear [60:02] layer but uh this should basically [60:11] run so we see that this runs and uh this [60:15] currently looks kind of spous but uh [60:17] we're going to build on top of this now [60:19] next up so far we've taken these indices [60:22] and we've encoded them based on the [60:23] identity of the uh tokens in inside idx [60:28] the next thing that people very often do [60:30] is that we're not just encoding the [60:31] identity of these tokens but also their [60:33] position so we're going to have a second [60:35] position uh embedding table here so [60:38] self. position embedding table is an an [60:41] embedding of block size by an embed and [60:44] so each position from zero to block size [60:46] minus one will also get its own [60:47] embedding vector and then here first let [60:50] me decode B BYT from idx do [60:54] shape and then here we're also going to [60:56] have a pause embedding which is the [60:58] positional embedding and these are this [61:00] is to arrange so this will be basically [61:03] just integers from Z to T minus one and [61:06] all of those integers from 0 to T minus [61:08] one get embedded through the table to [61:09] create a t by [61:11] C and then here this gets renamed to [61:14] just say x and x will be the addition of [61:18] the token embeddings with the positional [61:20] embeddings and here the broadcasting [61:22] note will work out so B by T by C plus T [61:25] by C [61:26] this gets right aligned a new dimension [61:28] of one gets added and it gets [61:30] broadcasted across [61:31] batch so at this point x holds not just [61:34] the token identities but the positions [61:37] at which these tokens occur and this is [61:39] currently not that useful because of [61:41] course we just have a simple byr model [61:43] so it doesn't matter if you're in the [61:44] fifth position the second position or [61:46] wherever it's all translation invariant [61:48] at this stage uh so this information [61:50] currently wouldn't help uh but as we [61:52] work on the self attention block we'll [61:54] see that this starts to matter [61:59] okay so now we get the Crux of self [62:01] attention so this is probably the most [62:03] important part of this video to [62:05] understand we're going to implement a [62:07] small self attention for a single [62:08] individual head as they're called so we [62:11] start off with where we were so all of [62:13] this code is familiar so right now I'm [62:16] working with an example where I Chang [62:17] the number of channels from 2 to 32 so [62:20] we have a 4x8 arrangement of tokens and [62:24] each to and the information each token [62:26] is currently 32 dimensional but we just [62:28] are working with random [62:30] numbers now we saw here that the code as [62:34] we had it before does a uh simple weight [62:37] simple average of all the past tokens [62:41] and the current token so it's just the [62:43] previous information and current [62:44] information is just being mixed together [62:45] in an average and that's what this code [62:48] currently achieves and it Doo by [62:50] creating this lower triangular structure [62:52] which allows us to mask out this uh we [62:55] uh Matrix that we create so we mask it [62:59] out and then we normalize it and [63:01] currently when we initialize the [63:03] affinities between all the different [63:05] sort of tokens or nodes I'm going to use [63:08] those terms [63:09] interchangeably so when we initialize [63:11] the affinities between all the different [63:13] tokens to be zero then we see that way [63:16] gives us this um structure where every [63:18] single row has these um uniform numbers [63:22] and so that's what that's what then uh [63:25] in this Matrix multiply makes it so that [63:27] we're doing a simple [63:28] average now we don't actually want this [63:32] to be all uniform because different uh [63:36] tokens will find different other tokens [63:38] more or less interesting and we want [63:40] that to be data dependent so for example [63:42] if I'm a vowel then maybe I'm looking [63:44] for consonants in my past and maybe I [63:46] want to know what those consonants are [63:48] and I want that information to flow to [63:50] me and so I want to now gather [63:52] information from the past but I want to [63:54] do it in the data dependent way and this [63:56] is the problem that self attention [63:58] solves now the way self attention solves [64:00] this is the following every single node [64:03] or every single token at each position [64:06] will emit two vectors it will emit a [64:09] query and it will emit a [64:12] key now the query Vector roughly [64:15] speaking is what am I looking for and [64:18] the key Vector roughly speaking is what [64:20] do I [64:21] contain and then the way we get [64:24] affinities between these uh tokens now [64:27] in a sequence is we basically just do a [64:29] do product between the keys and the [64:31] queries so my query dot products with [64:35] all the keys of all the other tokens and [64:37] that dot product now becomes [64:41] wayy and so um if the key and the query [64:45] are sort of aligned they will interact [64:47] to a very high amount and then I will [64:50] get to learn more about that specific [64:52] token as opposed to any other token in [64:55] the sequence [64:56] so let's implement this [65:00] now we're going to implement a [65:03] single what's called head of self [65:07] attention so this is just one head [65:09] there's a hyper parameter involved with [65:10] these heads which is the head size and [65:13] then here I'm initializing linear [65:15] modules and I'm using bias equals false [65:18] so these are just going to apply a [65:19] matrix multiply with some fixed [65:21] weights and now let me produce a key and [65:26] q k and Q by forwarding these modules on [65:29] X so the size of this will now [65:32] become B by T by 16 because that is the [65:36] head size and the same here B by T by [65:44] 16 so this being the head size so you [65:47] see here that when I forward this linear [65:49] on top of my X all the tokens in all the [65:52] positions in the B BYT Arrangement all [65:55] of them them in parallel and [65:57] independently produce a key and a query [65:59] so no communication has happened [66:01] yet but the communication comes now all [66:04] the queries will do product with all the [66:07] keys so basically what we want is we [66:09] want way now or the affinities between [66:12] these to be query multiplying key but we [66:16] have to be careful with uh we can't [66:18] Matrix multiply this we actually need to [66:20] transpose uh K but we have to be also [66:23] careful because these are when you have [66:25] The Bash Dimension so in particular we [66:27] want to transpose uh the last two [66:30] dimensions dimension1 and dimension -2 [66:33] so [66:36] -21 and so this Matrix multiply now will [66:40] basically do the following B by T by [66:44] 16 Matrix multiplies B by 16 by T to [66:49] give us B by T by [66:53] T right [66:56] so for every row of B we're now going to [66:58] have a t Square Matrix giving us the [67:01] affinities and these are now the way so [67:04] they're not zeros they are now coming [67:06] from this dot product between the keys [67:08] and the queries so this can now run I [67:11] can I can run this and the weighted [67:13] aggregation now is a function in a data [67:16] Bandon manner between the keys and [67:18] queries of these nodes so just [67:20] inspecting what happened [67:22] here the way takes on this form [67:26] and you see that before way was uh just [67:29] a constant so it was applied in the same [67:31] way to all the batch elements but now [67:33] every single batch elements will have [67:34] different sort of we because uh every [67:37] single batch element contains different [67:39] uh tokens at different positions and so [67:41] this is not data dependent so when we [67:44] look at just the zeroth uh Row for [67:47] example in the input these are the [67:49] weights that came out and so you can see [67:51] now that they're not just exactly [67:53] uniform um and in particular as an [67:55] example here for the last row this was [67:58] the eighth token and the eighth token [68:00] knows what content it has and it knows [68:02] at what position it's in and now the E [68:04] token based on that uh creates a query [68:08] hey I'm looking for this kind of stuff [68:10] um I'm a vowel I'm on the E position I'm [68:12] looking for any consonant at positions [68:14] up to four and then all the nodes get to [68:18] emit keys and maybe one of the channels [68:20] could be I am a I am a consonant and I [68:23] am in a position up to four and that [68:25] that key would have a high number in [68:27] that specific Channel and that's how the [68:29] query and the key when they do product [68:31] they can find each other and create a [68:33] high affinity and when they have a high [68:35] Affinity like say uh this token was [68:38] pretty interesting to uh to this eighth [68:41] token when they have a high Affinity [68:43] then through the softmax I will end up [68:45] aggregating a lot of its information [68:47] into my position and so I'll get to [68:49] learn a lot about [68:51] it now just this we're looking at way [68:55] after this has already happened um let [68:59] me erase this operation as well so let [69:01] me erase the masking and the softmax [69:03] just to show you the under the hood [69:04] internals and how that works so without [69:07] the masking in the softmax Whey comes [69:09] out like this right this is the outputs [69:11] of the do products um and these are the [69:14] raw outputs and they take on values from [69:15] negative you know two to positive two [69:18] Etc so that's the raw interactions and [69:21] raw affinities between all the nodes but [69:24] now if I'm going if I'm a fifth node I [69:26] will not want to aggregate anything from [69:28] the sixth node seventh node and the [69:30] eighth node so actually we use the upper [69:32] triangular masking so those are not [69:35] allowed to [69:37] communicate and now we actually want to [69:40] have a nice uh distribution uh so we [69:42] don't want to aggregate negative .11 of [69:45] this node that's crazy so instead we [69:47] exponentiate and normalize and now we [69:49] get a nice distribution that sums to one [69:51] and this is telling us now in the data [69:52] dependent manner how much of information [69:54] to aggregate from any of these tokens in [69:56] the [69:58] past so that's way and it's not zeros [70:01] anymore but but it's calculated in this [70:04] way now there's one more uh part to a [70:08] single self attention head and that is [70:10] that when we do the aggregation we don't [70:12] actually aggregate the tokens exactly we [70:15] aggregate we produce one more value here [70:17] and we call that the [70:20] value so in the same way that we [70:22] produced p and query we're also going to [70:23] create a value [70:26] and [70:26] then here we don't [70:30] aggregate X we calculate a v which is [70:34] just achieved by uh propagating this [70:37] linear on top of X again and then we [70:40] output way multiplied by V so V is the [70:44] elements that we aggregate or the the [70:46] vectors that we aggregate instead of the [70:47] raw [70:48] X and now of course uh this will make it [70:51] so that the output here of this single [70:53] head will be 16 dimensional because that [70:55] is the head [70:57] size so you can think of X as kind of [70:59] like private information to this token [71:01] if you if you think about it that way so [71:03] X is kind of private to this token so [71:06] I'm a fifth token at some and I have [71:08] some identity and uh my information is [71:11] kept in Vector X and now for the [71:14] purposes of the single head here's what [71:16] I'm interested in here's what I have and [71:20] if you find me interesting here's what I [71:21] will communicate to you and that's [71:23] stored in v and so V is the thing that [71:26] gets aggregated for the purposes of this [71:28] single head between the different [71:30] notes and that's uh basically the self [71:34] attention mechanism this is this is what [71:36] it does there are a few notes that I [71:39] would make like to make about attention [71:41] number one attention is a communication [71:44] mechanism you can really think about it [71:46] as a communication mechanism where you [71:48] have a number of nodes in a directed [71:50] graph where basically you have edges [71:52] pointed between noes like [71:53] this and what happens is every node has [71:56] some Vector of information and it gets [71:58] to aggregate information via a weighted [72:01] sum from all of the nodes that point to [72:03] it and this is done in a data dependent [72:06] manner so depending on whatever data is [72:08] actually stored that you should not at [72:09] any point in time now our graph doesn't [72:13] look like this our graph has a different [72:15] structure we have eight nodes because [72:17] the block size is eight and there's [72:18] always eight to [72:20] tokens and uh the first node is only [72:23] pointed to by itself the second node is [72:25] pointed to by the first node and itself [72:27] all the way up to the eighth node which [72:29] is pointed to by all the previous nodes [72:32] and itself and so that's the structure [72:34] that our directed graph has or happens [72:37] happens to have in Auto regressive sort [72:38] of scenario like language modeling but [72:41] in principle attention can be applied to [72:42] any arbitrary directed graph and it's [72:44] just a communication mechanism between [72:46] the nodes the second note is that notice [72:48] that there is no notion of space so [72:51] attention simply acts over like a set of [72:53] vectors in this graph and so by default [72:56] these nodes have no idea where they are [72:58] positioned in the space and that's why [72:59] we need to encode them positionally and [73:02] sort of give them some information that [73:03] is anchored to a specific position so [73:05] that they sort of know where they are [73:08] and this is different than for example [73:09] from convolution because if you're run [73:11] for example a convolution operation over [73:13] some input there's a very specific sort [73:15] of layout of the information in space [73:18] and the convolutional filters sort of [73:20] act in space and so it's it's not like [73:23] an attention in ATT ention is just a set [73:26] of vectors out there in space they [73:27] communicate and if you want them to have [73:29] a notion of space you need to [73:31] specifically add it which is what we've [73:33] done when we calculated the um relative [73:36] the positional encode encodings and [73:38] added that information to the vectors [73:40] the next thing that I hope is very clear [73:41] is that the elements across the batch [73:43] Dimension which are independent examples [73:45] never talk to each other they're always [73:47] processed independently and this is a [73:49] batched matrix multiply that applies [73:51] basically a matrix multiplication uh [73:53] kind of in parallel across the batch [73:54] dimension so maybe it would be more [73:56] accurate to say that in this analogy of [73:58] a directed graph we really have because [74:00] the back size is four we really have [74:03] four separate pools of eight nodes and [74:05] those eight nodes only talk to each [74:07] other but in total there's like 32 nodes [74:08] that are being processed uh but there's [74:11] um sort of four separate pools of eight [74:13] you can look at it that way the next [74:15] note is that here in the case of [74:18] language modeling uh we have this [74:20] specific uh structure of directed graph [74:22] where the future tokens will not [74:24] communicate to the Past tokens but this [74:27] doesn't necessarily have to be the [74:28] constraint in the general case and in [74:30] fact in many cases you may want to have [74:32] all of the uh noes talk to each other uh [74:35] fully so as an example if you're doing [74:37] sentiment analysis or something like [74:38] that with a Transformer you might have a [74:40] number of tokens and you may want to [74:42] have them all talk to each other fully [74:45] because later you are predicting for [74:46] example the sentiment of the sentence [74:49] and so it's okay for these NOS to talk [74:50] to each other and so in those cases you [74:53] will use an encoder block of self [74:55] attention and uh all it means that it's [74:58] an encoder block is that you will delete [75:00] this line of code allowing all the noes [75:02] to completely talk to each other what [75:04] we're implementing here is sometimes [75:06] called a decoder block and it's called a [75:09] decoder because it is sort of like a [75:12] decoding language and it's got this [75:15] autor regressive format where you have [75:17] to mask with the Triangular Matrix so [75:19] that uh nodes from the future never talk [75:22] to the Past because they would give away [75:24] the answer [75:25] and so basically in encoder blocks you [75:27] would delete this allow all the noes to [75:29] talk in decoder blocks this will always [75:31] be present so that you have this [75:33] triangular structure uh but both are [75:35] allowed and attention doesn't care [75:36] attention supports arbitrary [75:38] connectivity between nodes the next [75:40] thing I wanted to comment on is you keep [75:41] me you keep hearing me say attention [75:43] self attention Etc there's actually also [75:45] something called cross attention what is [75:47] the [75:47] difference [75:49] so basically the reason this attention [75:52] is self attention is because because the [75:55] keys queries and the values are all [75:57] coming from the same Source from X so [76:01] the same Source X produces Keys queries [76:03] and values so these nodes are self [76:05] attending but in principle attention is [76:08] much more General than that so for [76:10] example an encoder decoder Transformers [76:12] uh you can have a case where the queries [76:15] are produced from X but the keys and the [76:17] values come from a whole separate [76:18] external source and sometimes from uh [76:21] encoder blocks that encode some context [76:23] that we'd like to condition on [76:25] and so the keys and the values will [76:26] actually come from a whole separate [76:28] Source those are nodes on the side and [76:31] here we're just producing queries and [76:32] we're reading off information from the [76:34] side so cross attention is used when [76:37] there's a separate source of nodes we'd [76:40] like to pull information from into our [76:42] nodes and it's self attention if we just [76:45] have nodes that would like to look at [76:46] each other and talk to each other so [76:48] this attention here happens to be self [76:51] attention but in principle um attention [76:55] is a lot more General okay and the last [76:57] note at this stage is if we come to the [76:59] attention is all need paper here we've [77:01] already implemented attention so given [77:03] query key and value we've U multiplied [77:06] the query and a key we've soft maxed it [77:09] and then we are aggregating the values [77:11] there's one more thing that we're [77:12] missing here which is the dividing by [77:13] one / square root of the head size the [77:16] DK here is the head size why are they [77:18] doing this finds this important so they [77:21] call it the scaled attention and it's [77:24] kind of like an important normalization [77:25] to basically [77:26] have the problem is if you have unit gsh [77:29] and inputs so zero mean unit variance K [77:32] and Q are unit gashin then if you just [77:34] do we naively then you see that your we [77:37] actually will be uh the variance will be [77:38] on the order of head size which in our [77:40] case is 16 but if you multiply by one [77:43] over head size square root so this is [77:45] square root and this is one [77:47] over then the variance of we will be one [77:50] so it will be [77:52] preserved now why is this important [77:54] you'll not notice that way [77:56] here will feed into [77:58] softmax and so it's really important [78:00] especially at initialization that we be [78:03] fairly diffuse so in our case here we [78:06] sort of locked out here and we had a [78:10] fairly diffuse numbers here so um like [78:13] this now the problem is that because of [78:15] softmax if weight takes on very positive [78:18] and very negative numbers inside it [78:20] softmax will actually converge towards [78:22] one hot vectors and so I can illustrate [78:25] that here um say we are applying softmax [78:29] to a tensor of values that are very [78:31] close to zero then we're going to get a [78:33] diffuse thing out of [78:34] softmax but the moment I take the exact [78:36] same thing and I start sharpening it [78:38] making it bigger by multiplying these [78:40] numbers by eight for example you'll see [78:42] that the softmax will start to sharpen [78:44] and in fact it will sharpen towards the [78:46] max so it will sharpen towards whatever [78:48] number here is the highest and so um [78:51] basically we don't want these values to [78:52] be too extreme especially at [78:53] initialization otherwise softmax will be [78:55] way too peaky and um you're basically [78:58] aggregating um information from like a [79:01] single node every node just agregates [79:03] information from a single other node [79:04] that's not what we want especially at [79:06] initialization and so the scaling is [79:08] used just to control the variance at [79:11] initialization okay so having said all [79:13] that let's now take our self attention [79:15] knowledge and let's uh take it for a [79:17] spin so here in the code I created this [79:19] head module and it implements a single [79:22] head of self attention so you give it a [79:24] head size and then here it creates the [79:26] key query and the value linear layers [79:29] typically people don't use biases in [79:31] these uh so those are the linear [79:33] projections that we're going to apply to [79:34] all of our nodes now here I'm creating [79:37] this Trill variable Trill is not a [79:40] parameter of the module so in sort of [79:41] pytorch naming conventions uh this is [79:43] called a buffer it's not a parameter and [79:46] you have to call it you have to assign [79:47] it to the module using a register buffer [79:49] so that creates the trill uh the triang [79:52] lower triangular Matrix and we're given [79:55] the input X this should look very [79:56] familiar now we calculate the keys the [79:58] queries we C calculate the attention [80:00] scores inside way uh we normalize it so [80:03] we're using scaled attention here then [80:06] we make sure that uh future doesn't [80:08] communicate with the past so this makes [80:10] it a decoder block and then softmax and [80:13] then aggregate the value and [80:15] output then here in the language model [80:17] I'm creating a head in the Constructor [80:20] and I'm calling it self attention head [80:22] and the head size I'm going to keep as [80:24] the same and embed just for [80:27] now and then here once we've encoded the [80:31] information with the token embeddings [80:32] and the position embeddings we're simply [80:34] going to feed it into the self attention [80:36] head and then the output of that is [80:38] going to go into uh the decoder language [80:42] modeling head and create the logits so [80:44] this the sort of the simplest way to [80:46] plug in a self attention component uh [80:49] into our Network right now I had to make [80:51] one more change which is that here in [80:55] the generate uh we have to make sure [80:57] that our idx that we feed into the model [81:01] because now we're using positional [81:02] embeddings we can never have more than [81:04] block size coming in because if idx is [81:07] more than block size then our position [81:09] embedding table is going to run out of [81:11] scope because it only has embeddings for [81:12] up to block size and so therefore I [81:15] added some uh code here to crop the [81:17] context that we're going to feed into [81:20] self um so that uh we never pass in more [81:23] than block siiz elements [81:25] so those are the changes and let's Now [81:27] train the network okay so I also came up [81:29] to the script here and I decreased the [81:30] learning rate because uh the self [81:32] attention can't tolerate very very high [81:34] learning rates and then I also increased [81:36] number of iterations because the [81:37] learning rate is lower and then I [81:39] trained it and previously we were only [81:41] able to get to up to 2.5 and now we are [81:43] down to 2.4 so we definitely see a [81:46] little bit of an improvement from 2.5 to [81:48] 2.4 roughly uh but the text is still not [81:51] amazing so clearly the self attention [81:53] head is doing some useful communication [81:56] but um we still have a long way to go [81:59] okay so now we've implemented the scale. [82:01] product attention now next up and the [82:02] attention is all you need paper there's [82:05] something called multi-head attention [82:07] and what is multi-head attention it's [82:09] just applying multiple attentions in [82:11] parallel and concatenating their results [82:13] so they have a little bit of diagram [82:15] here I don't know if this is super clear [82:18] it's really just multiple attentions in [82:20] parallel so let's Implement that fairly [82:23] straightforward [82:25] if we want a multi-head attention then [82:27] we want multiple heads of self attention [82:28] running in parallel so in pytorch we can [82:32] do this by simply creating multiple [82:35] heads so however heads how however many [82:38] heads you want and then what is the head [82:39] size of each and then we run all of them [82:43] in parallel into a list and simply [82:46] concatenate all of the outputs and we're [82:48] concatenating over the channel [82:50] Dimension so the way this looks now is [82:53] we don't have just a single ATT [82:56] that uh has a hit size of 32 because [82:59] remember n Ed is [83:00] 32 instead of having one Communication [83:03] channel we now have four communication [83:06] channels in parallel and each one of [83:08] these communication channels typically [83:10] will be uh smaller uh correspondingly so [83:14] because we have four communication [83:15] channels we want eight dimensional self [83:18] attention and so from each Communication [83:20] channel we're going to together eight [83:22] dimensional vectors and then we have [83:23] four of them and that concatenates to [83:25] give us 32 which is the original and [83:28] embed and so this is kind of similar to [83:30] um if you're familiar with convolutions [83:32] this is kind of like a group convolution [83:34] uh because basically instead of having [83:36] one large convolution we do convolution [83:38] in groups and uh that's multi-headed [83:40] self [83:41] attention and so then here we just use [83:44] essay heads self attention heads instead [83:47] now I actually ran it and uh scrolling [83:51] down I ran the same thing and then we [83:53] now get this down to 2.28 roughly and [83:57] the output is still the generation is [83:58] still not amazing but clearly the [84:00] validation loss is improving because we [84:02] were at 2.4 just now and so it helps to [84:05] have multiple communication channels [84:07] because obviously these tokens have a [84:09] lot to talk about they want to find the [84:11] consonants the vowels they want to find [84:13] the vowels just from certain positions [84:15] uh they want to find any kinds of [84:17] different things and so it helps to [84:19] create multiple independent channels of [84:20] communication gather lots of different [84:22] types of data and then uh decode the [84:25] output now going back to the paper for a [84:27] second of course I didn't explain this [84:28] figure in full detail but we are [84:30] starting to see some components of what [84:32] we've already implemented we have the [84:33] positional encodings the token encodings [84:35] that add we have the masked multi-headed [84:37] attention implemented now here's another [84:41] multi-headed attention which is a cross [84:42] attention to an encoder which we haven't [84:45] we're not going to implement in this [84:46] case I'm going to come back to that [84:48] later but I want you to notice that [84:50] there's a feed forward part here and [84:52] then this is grouped into a block that [84:53] gets repeat it again and again now the [84:56] feedforward part here is just a simple [84:57] uh multi-layer perceptron [85:00] um so the multi-headed so here position [85:04] wise feed forward networks is just a [85:06] simple little MLP so I want to start [85:08] basically in a similar fashion also [85:10] adding computation into the network and [85:13] this computation is on a per node level [85:16] so I've already implemented it and you [85:18] can see the diff highlighted on the left [85:20] here when I've added or changed things [85:22] now before we had the self multi-headed [85:25] self attention that did the [85:26] communication but we went way too fast [85:28] to calculate the logits so the tokens [85:31] looked at each other but didn't really [85:32] have a lot of time to think on what they [85:35] found from the other tokens and so what [85:38] I've implemented here is a little feet [85:40] forward single layer and this little [85:42] layer is just a linear followed by a Rel [85:45] nonlinearity and that's that's it so [85:48] it's just a little layer and then I call [85:50] it feed [85:52] forward um and embed [85:54] and then this feed forward is just [85:56] called sequentially right after the self [85:58] attention so we self attend then we feed [86:01] forward and you'll notice that the feet [86:02] forward here when it's applying linear [86:04] this is on a per token level all the [86:06] tokens do this independently so the self [86:09] attention is the communication and then [86:11] once they've gathered all the data now [86:13] they need to think on that data [86:15] individually and so that's what feed [86:16] forward is doing and that's why I've [86:18] added it here now when I train this the [86:21] validation LW actually continues to go [86:23] down now to 2. 24 which is down from [86:26] 2.28 uh the output still look kind of [86:28] terrible but at least we've improved the [86:31] situation and so as a preview we're [86:34] going to now start to intersperse the [86:37] communication with the computation and [86:39] that's also what the Transformer does [86:42] when it has blocks that communicate and [86:44] then compute and it groups them and [86:46] replicates them okay so let me show you [86:49] what we'd like to do we'd like to do [86:51] something like this we have a block and [86:53] this block is is basically this part [86:55] here except for the cross [86:57] attention now the block basically [86:59] intersperses communication and then [87:01] computation the computation the [87:03] communication is done using multi-headed [87:05] selfelf attention and then the [87:07] computation is done using a feed forward [87:08] Network on all the tokens [87:11] independently now what I've added here [87:14] also is you'll [87:16] notice this takes the number of [87:18] embeddings in the embedding Dimension [87:19] and number of heads that we would like [87:21] which is kind of like group size in [87:22] group convolution and and I'm saying [87:24] that number of heads we'd like is four [87:26] and so because this is 32 we calculate [87:29] that because this is 32 the number of [87:31] heads should be four um the head size [87:34] should be eight so that everything sort [87:36] of works out Channel wise um so this is [87:39] how the Transformer structures uh sort [87:41] of the uh the sizes typically so the [87:44] head size will become eight and then [87:45] this is how we want to intersperse them [87:47] and then here I'm trying to create [87:49] blocks which is just a sequential [87:51] application of block block block so that [87:53] we're interspersing communication feed [87:55] forward many many times and then finally [87:57] we decode now I actually tried to run [88:01] this and the problem is this doesn't [88:02] actually give a very good uh answer and [88:05] very good result and the reason for that [88:07] is we're start starting to actually get [88:09] like a pretty deep neural net and deep [88:11] neural Nets uh suffer from optimization [88:13] issues and I think that's what we're [88:14] kind of like slightly starting to run [88:16] into so we need one more idea that we [88:18] can borrow from the um Transformer paper [88:21] to resolve those difficulties now there [88:23] are two optimizations that dramatically [88:25] help with the depth of these networks [88:27] and make sure that the networks remain [88:29] optimizable let's talk about the first [88:31] one the first one in this diagram is you [88:33] see this Arrow here and then this arrow [88:36] and this Arrow those are skip [88:38] connections or sometimes called residual [88:40] connections they come from this paper uh [88:43] the presidual learning for image [88:44] recognition from about [88:46] 2015 uh that introduced the concept now [88:51] these are basically what it means is you [88:53] transform data but then you have a skip [88:55] connection with addition from the [88:57] previous features now the way I like to [89:00] visualize it uh that I prefer is the [89:03] following here the computation happens [89:05] from the top to bottom and basically you [89:08] have this uh residual pathway and you [89:11] are free to Fork off from the residual [89:13] pathway perform some computation and [89:15] then project back to the residual [89:16] pathway via addition and so you go from [89:19] the the uh inputs to the targets only [89:22] via plus and plus plus and the reason [89:25] this is useful is because during back [89:27] propagation remember from our microG [89:29] grad video earlier addition distributes [89:32] gradients equally to both of its [89:34] branches that that fed as the input and [89:37] so the supervision or the gradients from [89:40] the loss basically hop through every [89:43] addition node all the way to the input [89:46] and then also Fork off into the residual [89:50] blocks but basically you have this [89:52] gradient Super Highway that goes [89:53] directly from the supervision all the [89:55] way to the input unimpeded and then [89:58] these viral blocks are usually [89:59] initialized in the beginning so they [90:01] contribute very very little if anything [90:03] to the residual pathway they they are [90:05] initialized that way so in the beginning [90:07] they are sort of almost kind of like not [90:09] there but then during the optimization [90:11] they come online over time and they uh [90:14] start to contribute but at least at the [90:17] initialization you can go from directly [90:19] supervision to the input gradient is [90:21] unimpeded and just flows and then the [90:23] blocks over time [90:24] kick in and so that dramatically helps [90:27] with the optimization so let's implement [90:29] this so coming back to our block here [90:31] basically what we want to do is we want [90:33] to do xal [90:35] X+ self attention and xal X+ self. feed [90:39] forward so this is X and then we Fork [90:43] off and do some communication and come [90:45] back and we Fork off and we do some [90:46] computation and come back so those are [90:49] residual connections and then swinging [90:51] back up here we also have to introd use [90:54] this projection so nn. [90:57] linear and uh this is going to be [91:00] from after we concatenate this this is [91:03] the prze and embed so this is the output [91:05] of the self tension itself but then we [91:08] actually want the uh to apply the [91:11] projection and that's the [91:13] result so the projection is just a [91:15] linear transformation of the outcome of [91:16] this [91:17] layer so that's the projection back into [91:20] the virual pathway and then here in a [91:22] feet forward it's going to be the same [91:23] same thing I could have a a self doot [91:26] projection here as well but let me just [91:28] simplify it and let me uh couple it [91:32] inside the same sequential container and [91:34] so this is the projection layer going [91:36] back into the residual [91:38] pathway and [91:40] so that's uh well that's it so now we [91:43] can train this so I implemented one more [91:44] small change when you look into the [91:47] paper again you see that the [91:49] dimensionality of input and output is [91:51] 512 for them and they're saying that the [91:53] inner layer here in the feet forward has [91:55] dimensionality of 248 so there's a [91:57] multiplier of four and so the inner [92:00] layer of the feet forward Network should [92:02] be multiplied by four in terms of [92:04] Channel sizes so I came here and I [92:06] multiplied four times embed here for the [92:08] feed forward and then from four times [92:10] nmed coming back down to nmed when we go [92:13] back to the pro uh to the projection so [92:15] adding a bit of computation here and [92:17] growing that layer that is in the [92:19] residual block on the side of the [92:21] residual [92:22] pathway and then I train this and we [92:24] actually get down all the way to uh 2.08 [92:27] validation loss and we also see that [92:29] network is starting to get big enough [92:30] that our train loss is getting ahead of [92:32] validation loss so we're starting to see [92:33] like a little bit of [92:35] overfitting and um our our [92:38] um uh Generations here are still not [92:41] amazing but at least you see that we can [92:42] see like is here this now grief syn like [92:46] this starts to almost look like English [92:48] so um yeah we're starting to really get [92:50] there okay and the second Innovation [92:52] that is very helpful for optimizing very [92:54] deep neural networks is right here so we [92:57] have this addition now that's the [92:58] residual part but this Norm is referring [93:00] to something called layer Norm so layer [93:03] Norm is implemented in pytorch it's a [93:04] paper that came out a while back here [93:09] um and layer Norm is very very similar [93:11] to bash Norm so remember back to our [93:14] make more series part three we [93:16] implemented bash [93:17] normalization and uh bash normalization [93:19] basically just made sure that um Across [93:22] The Bash dimension any individual neuron [93:25] had unit uh Gan um distribution so it [93:30] was zero mean and unit standard [93:32] deviation one standard deviation output [93:35] so what I did here is I'm copy pasting [93:37] the bashor 1D that we developed in our [93:39] make more series and see here we can [93:42] initialize for example this module and [93:44] we can have a batch of 32 100 [93:47] dimensional vectors feeding through the [93:48] bachor layer so what this does is it [93:52] guarantees that when we look at just the [93:54] zeroth column it's a zero mean one [93:58] standard deviation so it's normalizing [94:00] every single column of this uh input now [94:04] the rows are not uh going to be [94:06] normalized by default because we're just [94:08] normalizing columns so let's now [94:10] Implement layer Norm uh it's very [94:12] complicated look we come here we change [94:15] this from zero to one so we don't [94:18] normalize The Columns we normalize the [94:20] rows and now we've implemented layer [94:23] Norm [94:25] so now the columns are not going to be [94:28] normalized um but the rows are going to [94:31] be normalized for every individual [94:33] example it's 100 dimensional Vector is [94:35] normalized uh in this way and because [94:38] our computation Now does not span across [94:40] examples we can delete all of this [94:43] buffers stuff uh because uh we can [94:45] always apply this operation and don't [94:48] need to maintain any running buffers so [94:50] we don't need the [94:52] buffers uh we [94:54] don't There's no distinction between [94:56] training and test [94:58] time uh and we don't need these running [95:00] buffers we do keep gamma and beta we [95:03] don't need the momentum we don't care if [95:05] it's training or not and this is now a [95:08] layer [95:09] norm and it normalizes the rows instead [95:12] of the columns and this here is [95:15] identical to basically this here so [95:19] let's now Implement layer Norm in our [95:21] Transformer before I incorporate the [95:23] layer Norm I just wanted to note that as [95:25] I said very few details about the [95:27] Transformer have changed in the last 5 [95:28] years but this is actually something [95:30] that slightly departs from the original [95:31] paper you see that the ADD and Norm is [95:34] applied after the [95:36] transformation but um in now it is a bit [95:40] more uh basically common to apply the [95:42] layer Norm before the transformation so [95:44] there's a reshuffling of the layer Norms [95:46] uh so this is called the prorm [95:48] formulation and that's the one that [95:49] we're going to implement as well so [95:50] select deviation from the original paper [95:53] basically we need two layer Norms layer [95:55] Norm one is uh NN do layer norm and we [95:59] tell it how many um what is the [96:01] embedding Dimension and we need the [96:03] second layer norm and then here the [96:06] layer Norms are applied immediately on X [96:09] so self. layer Norm one applied on X and [96:13] self. layer Norm two applied on X before [96:15] it goes into self attention and feed [96:18] forward and uh the size of the layer [96:20] Norm here is an ed so 32 so when the [96:23] layer Norm is normalizing our features [96:26] it is uh the normalization here uh [96:30] happens the mean and the variance are [96:32] taken over 32 numbers so the batch and [96:34] the time act as batch Dimensions both of [96:37] them so this is kind of like a per token [96:40] um transformation that just normalizes [96:42] the features and makes them a unit mean [96:46] uh unit Gan at [96:48] initialization but of course because [96:50] these layer Norms inside it have these [96:52] gamma and beta training [96:54] parameters uh the layer Norm will U [96:57] eventually create outputs that might not [96:59] be unit gion but the optimization will [97:01] determine that so for now this is the uh [97:05] this is incorporating the layer norms [97:06] and let's train them on okay so I let it [97:09] run and we see that we get down to 2.06 [97:12] which is better than the previous 2.08 [97:14] so a slight Improvement by adding the [97:15] layer norms and I'd expect that they [97:17] help uh even more if we had bigger and [97:19] deeper Network one more thing I forgot [97:21] to add is that there should be a layer [97:23] Norm here also typically as at the end [97:26] of the Transformer and right before the [97:28] final uh linear layer that decodes into [97:31] vocabulary so I added that as well so at [97:35] this stage we actually have a pretty [97:36] complete uh Transformer according to the [97:38] original paper and it's a decoder only [97:40] Transformer I'll I'll talk about that in [97:42] a second uh but at this stage uh the [97:44] major pieces are in place so we can try [97:46] to scale this up and see how well we can [97:47] push this number now in order to scale [97:50] out the model I had to perform some [97:51] cosmetic changes here to make it nicer [97:54] so I introduced this variable called n [97:56] layer which just specifies how many [97:57] layers of the blocks we're going to have [98:01] I created a bunch of blocks and we have [98:02] a new variable number of heads as well I [98:05] pulled out the layer Norm here and uh so [98:07] this is identical now one thing that I [98:10] did briefly change is I added a Dropout [98:13] so Dropout is something that you can add [98:15] right before the residual connection [98:17] back right before the connection back [98:19] into the residual pathway so we can drop [98:22] out that as l layer here we can drop out [98:26] uh here at the end of the multi-headed [98:27] exension as well and we can also drop [98:30] out here uh when we calculate the um [98:34] basically affinities and after the [98:36] softmax we can drop out some of those so [98:38] we can randomly prevent some of the [98:40] nodes from [98:41] communicating and so Dropout uh comes [98:43] from this paper from 2014 or so and [98:49] basically it takes your neural [98:50] nut and it randomly every forward [98:53] backward pass shuts off some subset of [98:56] uh neurons so randomly drops them to [98:59] zero and trains without them and what [99:02] this does effectively is because the [99:04] mask of what's being dropped out is [99:06] changed every single forward backward [99:07] pass it ends up kind of uh training an [99:11] ensemble of sub networks and then at [99:13] test time everything is fully enabled [99:15] and kind of all of those sub networks [99:16] are merged into a single Ensemble if you [99:18] can if you want to think about it that [99:20] way so I would read the paper to get the [99:22] full detail for now we're just going to [99:24] stay on the level of this is a [99:25] regularization technique and I added it [99:28] because I'm about to scale up the model [99:30] quite a bit and I was concerned about [99:32] overfitting so now when we scroll up to [99:34] the top uh we'll see that I changed a [99:36] number of hyper parameters here about [99:38] our neural nut so I made the batch size [99:40] be much larger now it's 64 I changed the [99:43] block size to be 256 so previously it [99:46] was just eight eight characters of [99:47] context now it is 256 characters of [99:50] context to predict the 257th [99:54] uh I brought down the learning rate a [99:55] little bit because the neural net is now [99:57] much bigger so I brought down the [99:58] learning rate the embedding Dimension is [100:01] now 384 and there are six heads so 384 [100:05] divide 6 means that every head is 64 [100:08] dimensional as it as a standard and then [100:11] there's going to be six layers of that [100:13] and the Dropout will be at 02 so every [100:15] forward backward pass 20% of all of [100:18] these um intermediate calculations are [100:21] disabled and dropped to zero [100:24] and then I already trained this and I [100:25] ran it so uh drum roll how well does it [100:28] perform so let me just scroll up [100:31] here we get a validation loss of [100:34] 1.48 which is actually quite a bit of an [100:37] improvement on what we had before which [100:38] I think was 2.07 so it went from 2.07 [100:41] all the way down to 1.48 just by scaling [100:43] up this neural nut with the code that we [100:45] have and this of course ran for a lot [100:47] longer this maybe trained for I want to [100:49] say about 15 minutes on my a100 GPU so [100:52] that's a pretty a GPU and if you don't [100:54] have a GPU you're not going to be able [100:56] to reproduce this uh on a CPU this would [100:59] be um I would not run this on a CPU or [101:01] MacBook or something like that you'll [101:03] have to Brak down the number of uh [101:04] layers and the embedding Dimension and [101:06] so on uh but in about 15 minutes we can [101:09] get this kind of a result and um I'm [101:12] printing some of the Shakespeare here [101:15] but what I did also is I printed 10,000 [101:17] characters so a lot more and I wrote [101:18] them to a file and so here we see some [101:21] of the outputs [101:24] so it's a lot more recognizable as the [101:26] input text file so the input text file [101:29] just for reference looked like this so [101:31] there's always like someone speaking in [101:33] this manner and uh our predictions now [101:37] take on that form except of course [101:40] they're they're nonsensical when you [101:41] actually read them [101:43] so it is every crimp tap be a house oh [101:47] those [101:48] prepation we give [101:51] heed um you know [101:56] Oho sent me you mighty [101:59] Lord anyway so you can read through this [102:02] um it's nonsensical of course but this [102:04] is just a Transformer trained on a [102:06] character level for 1 million characters [102:09] that come from Shakespeare so there's [102:10] sort of like blabbers on in Shakespeare [102:12] like manner but it doesn't of course [102:14] make sense at this scale uh but I think [102:18] I think still a pretty good [102:19] demonstration of what's [102:20] possible so now [102:24] I think uh that kind of like concludes [102:26] the programming section of this video we [102:28] basically kind of uh did a pretty good [102:30] job and um of implementing this [102:32] Transformer uh but the picture doesn't [102:35] exactly match up to what we've done so [102:37] what's going on with all these digital [102:38] Parts here so let me finish explaining [102:41] this architecture and why it looks so [102:43] funky basically what's happening here is [102:45] what we implemented here is a decoder [102:47] only Transformer so there's no component [102:50] here this part is called the encoder and [102:52] there's no cross attention block here [102:55] our block only has a self attention and [102:58] the feet forward so it is missing this [103:00] third in between piece here this piece [103:03] does cross attention so we don't have it [103:05] and we don't have the encoder we just [103:07] have the decoder and the reason we have [103:08] a decoder only uh is because we are just [103:12] uh generating text and it's [103:13] unconditioned on anything we're just [103:15] we're just blabbering on according to a [103:16] given data set what makes it a decoder [103:19] is that we are using the Triangular mask [103:21] in our uh trans former so it has this [103:24] Auto regressive property where we can [103:26] just uh go and sample from it so the [103:28] fact that it's using the Triangular [103:30] triangular mask to mask out the [103:32] attention makes it a decoder and it can [103:34] be used for language modeling now the [103:37] reason that the original paper had an [103:39] incoder decoder architecture is because [103:41] it is a machine translation paper so it [103:43] is concerned with a different setting in [103:45] particular it expects some uh tokens [103:49] that encode say for example French and [103:52] then it is expecting to decode the [103:54] translation in English so so you [103:56] typically these here are special tokens [103:59] so you are expected to read in this and [104:02] condition on it and then you start off [104:04] the generation with a special token [104:05] called start so this is a special new [104:08] token um that you introduce and always [104:10] place in the beginning and then the [104:12] network is expected to Output neural [104:15] networks are awesome and then a special [104:17] end token to finish the [104:20] generation so this part here will be [104:23] decoded exactly as we we've done it [104:25] neural networks are awesome will be [104:27] identical to what we did but unlike what [104:29] we did they wanton to condition the [104:32] generation on some additional [104:34] information and in that case this [104:36] additional information is the French [104:38] sentence that they should be [104:39] translating so what they do now is they [104:42] bring in the encoder now the encoder [104:45] reads this part here so we're only going [104:48] to take the part of French and we're [104:50] going to uh create tokens from it [104:52] exactly as we've seen in our video and [104:54] we're going to put a Transformer on it [104:57] but there's going to be no triangular [104:58] mask and so all the tokens are allowed [105:00] to talk to each other as much as they [105:02] want and they're just encoding [105:04] whatever's the content of this French uh [105:07] sentence once they've encoded it they [105:10] they basically come out in the top here [105:13] and then what happens here is in our [105:14] decoder which does the uh language [105:17] modeling there's an additional [105:20] connection here to the outputs of the [105:22] encoder [105:23] and that is brought in through a cross [105:26] attention so the queries are still [105:28] generated from X but now the keys and [105:30] the values are coming from the side the [105:32] keys and the values are coming from the [105:34] top generated by the nodes that came [105:36] outside of the de the encoder and those [105:40] tops the keys and the values there the [105:42] top of it feed in on a side into every [105:45] single block of the decoder and so [105:47] that's why there's an additional cross [105:49] attention and really what it's doing is [105:51] it's conditioning the decoding [105:53] not just on the past of this current [105:55] decoding but also on having seen the [105:59] full fully encoded French um prompt sort [106:04] of and so it's an encoder decoder model [106:06] which is why we have those two [106:07] Transformers an additional block and so [106:09] on so we did not do this because we have [106:12] no we have nothing to encode there's no [106:13] conditioning we just have a text file [106:15] and we just want to imitate it and [106:16] that's why we are using a decoder only [106:19] Transformer exactly as done in [106:21] GPT okay okay so now I wanted to do a [106:24] very brief walkthrough of nanog GPT [106:26] which you can find in my GitHub and uh [106:28] nanog GPT is basically two files of [106:30] Interest there's train.py and model.py [106:33] train.py is all the boilerplate code for [106:35] training the network it is basically all [106:38] the stuff that we had here it's the [106:40] training loop it's just that it's a lot [106:42] more complicated because we're saving [106:44] and loading checkpoints and pre-trained [106:46] weights and we are uh decaying the [106:48] learning rate and compiling the model [106:50] and using distributed training across [106:51] multiple nodes or GP use so the training [106:54] Pi gets a little bit more hairy [106:56] complicated uh there's more options Etc [106:59] but the model.py should look very very [107:01] um similar to what we've done here in [107:04] fact the model is is almost identical so [107:08] first here we have the causal self [107:09] attention block and all of this should [107:11] look very very recognizable to you we're [107:13] producing queries Keys values we're [107:16] doing Dot products we're masking [107:18] applying soft Maxs optionally dropping [107:20] out and here we are pulling the wi the [107:23] values what is different here is that in [107:25] our code I have separated out the [107:30] multi-headed detention into just a [107:31] single individual head and then here I [107:34] have multiple heads and I explicitly [107:36] concatenate them whereas here uh all of [107:39] it is implemented in a batched manner [107:41] inside a single causal self attention [107:43] and so we don't just have a b and a T [107:45] and A C Dimension we also end up with a [107:47] fourth dimension which is the heads and [107:50] so it just gets a lot more sort of hairy [107:52] because we have four dimensional array [107:54] um tensors now but it is um equivalent [107:57] mathematically so the exact same thing [107:59] is happening as what we have it's just [108:01] it's a bit more efficient because all [108:02] the heads are now treated as a batch [108:04] Dimension as [108:05] well then we have the multier perceptron [108:08] it's using the Galu nonlinearity which [108:10] is defined here except instead of Ru and [108:13] this is done just because opening I used [108:14] it and I want to be able to load their [108:17] checkpoints uh the blocks of the [108:19] Transformer are identical to communicate [108:21] in the compute phase as we saw and then [108:23] the GPT will be identical we have the [108:25] position encodings token encodings the [108:27] blocks the layer Norm at the end uh the [108:30] final linear layer and this should look [108:33] all very recognizable and there's a bit [108:35] more here because I'm loading [108:36] checkpoints and stuff like that I'm [108:38] separating out the parameters into those [108:40] that should be weight decayed and those [108:42] that [108:42] shouldn't um but the generate function [108:44] should also be very very similar so a [108:47] few details are different but you should [108:48] definitely be able to look at this uh [108:51] file and be able to understand little [108:52] the pieces now so let's now bring things [108:55] back to chat GPT what would it look like [108:57] if we wanted to train chat GPT ourselves [108:59] and how does it relate to what we [109:00] learned today well to train in chat GPT [109:03] there are roughly two stages first is [109:05] the pre-training stage and then the [109:07] fine-tuning stage in the pre-training [109:09] stage uh we are training on a large [109:12] chunk of internet and just trying to get [109:14] a first decoder only Transformer to [109:17] babble text so it's very very similar to [109:20] what we've done ourselves except we've [109:23] done like a tiny little baby [109:24] pre-training step um and so in our case [109:28] uh this is how you print a number of [109:30] parameters I printed it and it's about [109:32] 10 million so this Transformer that I [109:35] created here to create little [109:37] Shakespeare um Transformer was about 10 [109:40] million parameters our data set is [109:42] roughly 1 million uh characters so [109:45] roughly 1 million tokens but you have to [109:47] remember that opening I is different [109:48] vocabulary they're not on the Character [109:50] level they use these um subword chunks [109:53] of words and so they have a vocabulary [109:55] of 50,000 roughly elements and so their [109:58] sequences are a bit more condensed so [110:01] our data set the Shakespeare data set [110:03] would be probably around 300,000 uh [110:05] tokens in the open AI vocabulary roughly [110:09] so we trained about 10 million parameter [110:11] model on roughly 300,000 tokens now when [110:14] you go to the gpt3 [110:16] paper and you look at the Transformers [110:20] that they trained they trained a number [110:22] of trans Transformers of different sizes [110:24] but the biggest Transformer here has 175 [110:27] billion parameters uh so ours is again [110:29] 10 million they used this number of [110:31] layers in the Transformer this is the [110:34] nmed this is the number of heads and [110:36] this is the head size and then this is [110:39] the batch size uh so ours was [110:43] 65 and the learning rate is similar now [110:46] when they train this Transformer they [110:47] trained on 300 billion tokens so again [110:51] remember ours is about 300,000 [110:53] so this is uh about a millionfold [110:56] increase and this number would not be [110:57] even that large by today's standards [110:59] you'd be going up uh 1 trillion and [111:01] above so they are training a [111:04] significantly larger [111:06] model on uh a good chunk of the internet [111:10] and that is the pre-training stage but [111:12] otherwise these hyper parameters should [111:13] be fairly recognizable to you and the [111:15] architecture is actually like nearly [111:17] identical to what we implemented [111:18] ourselves but of course it's a massive [111:20] infrastructure challenge to train this [111:22] you're talking about typically thousands [111:24] of gpus having to you know talk to each [111:27] other to train models of this size so [111:29] that's just a pre-training stage now [111:32] after you complete the pre-training [111:33] stage uh you don't get something that [111:35] responds to your questions with answers [111:38] and is not helpful and Etc you get a [111:40] document [111:41] completer right so it babbles but it [111:44] doesn't Babble Shakespeare it babbles [111:46] internet it will create arbitrary news [111:48] articles and documents and it will try [111:50] to complete documents because that's [111:51] what it's trained for it's trying to [111:52] complete the sequence so when you give [111:54] it a question it would just uh [111:56] potentially just give you more questions [111:58] it would follow with more questions it [112:00] will do whatever it looks like the some [112:02] close document would do in the training [112:05] data on the internet and so who knows [112:07] you're getting kind of like undefined [112:08] Behavior it might basically answer with [112:11] to questions with other questions it [112:13] might ignore your question it might just [112:15] try to complete some news article it's [112:17] totally unineed as we say so the second [112:20] fine-tuning stage is to actually align [112:22] it to be an assistant and uh this is the [112:25] second stage and so this chat GPT block [112:28] post from openi talks a little bit about [112:30] how the stage is achieved we basically [112:34] um there's roughly three steps to to [112:36] this stage uh so what they do here is [112:39] they start to collect training data that [112:41] looks specifically like what an [112:42] assistant would do so these are [112:44] documents that have to format where the [112:46] question is on top and then an answer is [112:47] below and they have a large number of [112:50] these but probably not on the order of [112:51] the internet uh this is probably on the [112:53] of maybe thousands of examples and so [112:58] they they then fine-tune the model to [113:00] basically only focus on documents that [113:03] look like that and so you're starting to [113:05] slowly align it so it's going to expect [113:07] a question at the top and it's going to [113:08] expect to complete the answer and uh [113:11] these very very large models are very [113:13] sample efficient during their [113:14] fine-tuning so this actually somehow [113:16] works but that's just step one that's [113:19] just fine tuning so then they actually [113:20] have more steps where okay the second [113:23] step is you let the model respond and [113:25] then different Raiders look at the [113:27] different responses and rank them for [113:29] their preference as to which one is [113:30] better than the other they use that to [113:32] train a reward model so they can predict [113:35] uh basically using a different network [113:37] how much of any candidate [113:39] response would be desirable and then [113:43] once they have a reward model they run [113:45] po which is a form of polic policy [113:47] gradient um reinforcement learning [113:49] Optimizer to uh fine-tune this sampling [113:53] policy uh so that the answers that the [113:55] GP chat GPT now generates are expected [113:59] to score a high reward according to the [114:02] reward model and so basically there's a [114:04] whole aligning stage here or fine-tuning [114:07] stage it's got multiple steps in between [114:09] there as well and it takes the model [114:11] from being a document completer to a [114:14] question answerer and that's like a [114:16] whole separate stage a lot of this data [114:19] is not available publicly it is internal [114:21] to open AI and uh it's much harder to [114:24] replicate this stage um and so that's [114:27] roughly what would give you a chat GPT [114:29] and nanog GPT focuses on the [114:31] pre-training stage okay and that's [114:32] everything that I wanted to cover today [114:35] so we trained to summarize a decoder [114:38] only Transformer following this famous [114:41] paper attention is all you need from [114:43] 2017 and so that's basically a GPT we [114:47] trained it on Tiny Shakespeare and got [114:50] sensible results [114:52] all of the training code is [114:54] roughly 200 lines of code I will be [114:57] releasing this um code base so also it [115:01] comes with all the git log commits along [115:04] the way as we built it [115:05] up in addition to this code I'm going to [115:08] release the um notebook of course the [115:10] Google collab and I hope that gave you a [115:13] sense for how you can train um these [115:16] models like say gpt3 that will be um [115:19] architecturally basically identical to [115:20] what we have but they are somewhere [115:22] between 10,000 and 1 million times [115:24] bigger depending on how you count and so [115:27] uh that's all I have for now uh we did [115:30] not talk about any of the fine-tuning [115:32] stages that would typically go on top of [115:33] this so if you're interested in [115:35] something that's not just language [115:36] modeling but you actually want to you [115:38] know say perform tasks um or you want [115:40] them to be aligned in a specific way or [115:43] you want um to detect sentiment or [115:45] anything like that basically anytime you [115:47] don't want something that's just a [115:48] document completer you have to complete [115:50] further stages of fine tuning which did [115:52] not cover uh and that could be simple [115:55] supervised fine tuning or it can be [115:57] something more fancy like we see in chat [115:58] jpt where we actually train a reward [116:00] model and then do rounds of Po to uh [116:03] align it with respect to the reward [116:04] model so there's a lot more that can be [116:06] done on top of it I think for now we're [116:08] starting to get to about two hours Mark [116:10] uh so I'm going to um kind of finish [116:13] here uh I hope you enjoyed the lecture [116:15] uh and uh yeah go forth and transform [116:18] see you later