[0:00] Chad GPT is powered by a model called [0:02] GPT which is based on a deep learning [0:05] architecture called Transformers [0:07] Transformers is the reason behind modern [0:09] day AI boom as an AI Enthusiast when you [0:12] start learning Transformers you will [0:14] come across this complex diagram which [0:16] will start giving you a headache [0:18] immediately my goal for today's video is [0:21] to explain you Transformers in a most [0:23] simplified and intuitive manner we need [0:26] to cover many different topics so this [0:28] is going to be a long video attention [0:31] and patience is all I need from you [0:33] today when you type in a sentence in [0:35] Gmail it tries to predict next word or [0:37] next set of words this is possible [0:39] because of a machine learning model [0:41] called language model Google for example [0:44] has this popular language model called [0:46] bird which is powering hundreds of AI [0:49] applications throughout the world GPT [0:52] which is a model behind chat GPT is a [0:55] large language model the reason it is [0:57] called large language model is because [0:59] it has billions of parameters it is much [1:02] more capable and Powerful compared to [1:05] bird and it is trained on humongous [1:07] amount of data fundamentally though it [1:10] is also doing the same thing which is [1:12] when you type in a question in chat GPT [1:15] it will predict the next word in that [1:18] sentence and then it will take the [1:20] original question and the next predicted [1:23] word as an input and then predict the [1:25] next word and then the next word and so [1:28] on in the end it produces is a complete [1:31] answer which almost sounds like a magic [1:33] to summarize the goal of a language [1:36] model is to predict a next word in a [1:39] sentence now that we have understood [1:41] this fundamental let's look into some of [1:44] the topics which needs to be clarified [1:46] before we dig into the actual [1:48] architecture the first concept we need [1:50] to understand is word embedding machine [1:53] learning models do not understand text [1:56] they understand numbers so we need to [1:58] represent text as numbers let's say you [2:01] have this word King you want to [2:03] numerically represent it how would you [2:06] do that well you can assign a fixed [2:08] number you can have a vocabulary and you [2:10] can assign just fixed static number but [2:13] that will not capture the meaning of it [2:15] when you're building a language model [2:18] you have to represent words in such a [2:21] way that they capture the meaning of [2:24] that word one way to capture the meaning [2:27] of this word King and represent it [2:30] numerically is to ask bunch of questions [2:33] for example does this person has [2:35] Authority yes one do they have a tail no [2:38] horse has a tail King doesn't have a [2:40] tail are they Rich yes gender minus one [2:42] is male one is female and so on what we [2:46] just did is we created this Vector list [2:50] of numbers which is a vector to [2:53] represent this word King similarly we [2:55] can represent the word queen as well not [2:58] only that we can can represent bunch of [3:01] words such as battle horse King and so [3:04] on by asking set of questions okay so [3:07] for battle they don't have authority are [3:09] they an event yes do battle has tail no [3:13] battle is an event it it doesn't have [3:15] tail and so on similarly horse do they [3:17] have authority well we'll just say 01 if [3:21] it is King's horse maybe they have some [3:23] Authority or maybe they have authority [3:25] over their horse kids and so on [3:29] similarly we can represent all these [3:31] words in a numeric format and then we [3:34] can take the vector of this word King [3:39] and maybe we can do some mathematics [3:41] with it we can say King minus man so [3:45] here I'm taking the vector of man right [3:49] which is uh this particular Vector plus [3:51] woman which is this particular vector [3:54] and when you do the math which is like 1 [3:56] - 2 + 2 will be [4:00] Z and so on you get a vector which looks [4:04] similar to Queen now this sounds like a [4:06] magic we can do math with Words which is [4:09] King minus man plus woman is equal to [4:13] Queen here King was represented in five [4:16] Dimensions when you look at the real [4:19] life for example Google's word to wack [4:22] model it has 300 dimensions and what are [4:26] all these questions by the way well we [4:28] actually don't know this has been [4:30] trained through a neural network and we [4:34] have processed huge amount of text such [4:36] as all the Wikipedia articles all the [4:39] books and text on internet to understand [4:42] the relationship between these words and [4:45] through that neural network training [4:47] back propagation we came up with this [4:49] Vector the example that I gave for King [4:53] where we asked these questions Authority [4:55] and so on that was just a madeup example [4:58] for building intuition for word [5:01] embeddings in real life we do not know [5:05] what all these number means all we can [5:08] say is these numbers are the features [5:11] for this word and they capture the [5:14] meaning of this word King let's say king [5:16] is a three-dimensional embedding if you [5:20] have to represent that in this 3D [5:23] embedding space you can represent it [5:25] like this where x axis has three like [5:28] this three number y AIS has this eight [5:30] number z-axis has this two number and so [5:33] on I use three dimensions because as [5:35] humans we can view only three dimensions [5:38] we can't possibly view this 300 [5:40] Dimension okay but mathematically those [5:43] 300 dimensions are possible models like [5:46] GPT uses an embedding Vector which is [5:49] even 12,000 Dimensions okay so it's a [5:53] very rich High dimensional space that we [5:55] are working with in threedimensional [5:57] space you can have vectors for King king [5:59] and queen that looks something like this [6:02] and if you look at this Vector which is [6:05] joining king and queen you can think of [6:08] that as a gender Direction the benefit [6:11] of this gender Direction Vector is that [6:14] when you have another embedding for [6:16] Uncle you can add that gender Direction [6:19] and get the embedding for word Aunt [6:23] similarly if you have father you can get [6:25] mother if you have man you can uh get [6:28] woman and so on and that allows you to [6:30] do this amazing math such as king minus [6:33] Queen plus uncle is equal to Aunt [6:35] another example is you can have country [6:38] to Capital City Direction Vector which [6:41] you can use to add it in this embeding [6:45] of Russia to get the embedding of Moscow [6:48] you can do Russia minus Moscow plus [6:50] Delhi equal to India now the embedding [6:53] that we're talking about are static [6:55] embeddings wtu and glow are two popular [6:59] models [7:00] which helps you get the static embedding [7:02] static means fixed embedding for all [7:05] these words you may ask how these [7:07] embeddings are generated well I already [7:09] answered the question which is you train [7:11] a neural network model on humongous [7:14] amount of text Wikipedia books and so on [7:16] to understand the relationship between [7:19] the words I'm not going to go into the [7:21] mathematical details of word to you can [7:24] refer to some other material on internet [7:27] I have YouTube videos for that I'm not [7:30] going to go into the math of that but [7:33] just think that uh the neural network [7:35] tries to understand the relationship [7:37] between these words and creates these [7:39] static embeddings now the problem with [7:41] static embedding is that you can have a [7:44] static embedding for this word track but [7:48] based on a sentence that track can me [7:50] mean different things right like here [7:52] I'm saying the train will run on the [7:54] track and my package is late help me [7:56] track it so the meaning of track is [7:59] little different and when you have [8:00] static embedding you get into this [8:03] problem where you're not able to [8:04] represent this word properly based on [8:07] the context of this sentence you will [8:10] see same issue here for Dish you can [8:13] have a fixed embedding but in this [8:16] sentence I'm talking about rice dish if [8:19] I had a cheese dish then the embedding [8:21] of dish should be a little different [8:23] because the meaning of that dish word is [8:26] little different when I say rice dish [8:27] versus cheese dish when you are working [8:30] on predicting the next word for this [8:33] sentence you can have words such as [8:35] risotto itly Mexican rice but when I say [8:39] I made an Indian rice dish call all of a [8:42] sudden the probabilities of my next [8:44] words will change I will have words such [8:47] as idly Biryani ke if I add one more [8:51] adjective and say I made a sweet Indian [8:53] rice dish in that case again it will [8:56] change I will not have Biryani as a next [8:59] word prediction I will probably have K [9:01] or Pongal to summarize to build an [9:04] application like CH GPT just the static [9:08] word embeddings are not enough what you [9:10] need is contextual embedding let's [9:13] understand contextual embedding a bit [9:15] more in detail when you represent this [9:18] word dish in your embedding space it is [9:21] aesthetic embedding when you say rise [9:25] this maybe there is another Vector in [9:28] the same space which can accurately [9:31] describe rice dis or which can correctly [9:36] capture the meaning of word rice dish [9:38] which can be Roto Biryani and so on the [9:41] direction from dish to rice dish we can [9:43] call it ress and when you add that ress [9:46] Vector to Dish what you get is the [9:49] embedding for a rice dish there is [9:52] another vector or embedding for Indian [9:55] rice dis and to go from Rice dis to [9:57] Indian rice dis you need to probably add [9:59] this vector or a direction called [10:01] indianness and same thing for sweet [10:04] Indian rice dish in order to generate [10:07] contextual embedding what we need to do [10:10] is take the original static embedding [10:13] for the word dish and have all these [10:16] other words influence that static [10:19] embedding or change that static [10:22] embedding so that it can capture the [10:25] meaning of all these adjectives once you [10:27] have done that and once you have a a [10:29] contextual embedding for dash it won't [10:31] be hard to predict the next word which [10:33] is K look at this another sentence where [10:37] I'm saying D loves Dosa Dola and Millet [10:40] and so on B loves pasta and so on they [10:43] both went out for a dinner and here BN [10:47] said bro we'll go to a restaurant that [10:49] you like and after some time they were [10:52] in Indian restaurant now the way you [10:55] predicted this word Indian was based on [10:58] this cont [11:00] such as D LS all these items which are [11:03] part of the Indian Cuisine also bin said [11:08] to D bro will go to a restaurant you [11:11] like if bin said here instead of you if [11:14] he had said I this word will become [11:18] Italian instead of Indian right also [11:22] instead of B if there was double here [11:25] then also this thing will become Italian [11:29] so you can understand that this uh [11:31] prediction Indian uh is influenced by [11:35] not just the few words which are prior [11:38] to that word but it can be influenced by [11:41] some words which are far out in that [11:43] paragraph okay to summarize the [11:45] objective for this intelligent teacher [11:48] cat is to generate a contextual [11:52] embedding and if you think about this [11:53] embedding space mathematically speaking [11:56] what you're doing is taking the static [11:59] embedding for the word Dash and then [12:01] adding the embeddings for all these [12:05] vectors ress indianness all these [12:07] adjectives and getting your final [12:10] contextual embedding let's now dig into [12:13] the Transformer architecture the [12:14] architecture has two components encoder [12:18] and decoder the purpose of encoder is to [12:21] take the input sentence and generate the [12:24] contextual embedding for each of the [12:26] words or each of the tokens in that [12:29] sentence once the contextual embedding [12:32] is generated we feed that to a decoder [12:36] here and we try to predict the next word [12:39] so if you're working on the next word [12:42] prediction you will predict the next [12:44] word here for example it will be here [12:46] when you talk about natural language [12:48] processing there are multiple tasks so [12:50] one task is to predict the next word the [12:53] other task would be to translate the [12:55] sentence here I'm translating the [12:58] sentence from English English to Hindi [13:00] in that case you will still produce the [13:03] contextual embedding from encoder you [13:06] feed that to decoder and decoder will [13:10] start predicting the next word so here [13:12] it will start with this fixed start [13:15] token and then it will uh produce the [13:19] probability of the next word so here the [13:22] probability of this word man is highest [13:25] and here you can have the entire [13:27] vocabulary for example in the case of [13:31] bird you have some 30,000 words so [13:34] you'll have all the words in your [13:36] language and you will say okay what is [13:38] the highest probability of my next word [13:41] then you put that word man into this [13:44] input okay so there are two inputs [13:47] actually one is the contextual embedding [13:49] which is coming uh from here from the [13:52] encoder and the other one is whatever [13:55] output you have produced so far from the [13:57] previous step you feed that okay as an [14:00] input here and then it will produce the [14:03] next word which is K once again you [14:06] provide key here and it produces banai [14:10] So eventually it produces the entire [14:13] translated sentence all right so that's [14:15] the objective of your encoder and [14:18] decoder part whatever I talked about so [14:21] far I was referring to an inference [14:25] stage of uh neural networks whenever you [14:28] have these uh deep learning model you [14:31] have two stages one stage is the model [14:33] is not trained it's like a baby baby is [14:35] not trained yet and you train them right [14:38] you send them to school you train them [14:41] uh in your home at some point they [14:43] become adult and they can figure things [14:45] out on their own similarly a machine [14:48] learning model goes through a training [14:49] phase and when it is ready it starts [14:53] working on the real world problem and [14:55] that is called inference so whatever I [14:57] talked about for for predicting the next [15:00] word or translating sentence I was [15:03] referring to inference stage throughout [15:05] the discussion we'll be referring to two [15:07] specific models called bird and GPT if [15:11] you look at this architecture that's a [15:13] generic architecture for a Transformer [15:15] model Transformer model is a general [15:18] concept whereas bird and GPT are [15:21] specific model or specific [15:23] implementations based on Transformer [15:26] architecture if you look at bird [15:27] architecture it has only the encoder [15:31] part okay so only this part decoder part [15:34] is missing so but will take the input [15:36] sentence it will produce the contextual [15:40] embedding and that's it whereas GPT has [15:42] only decoder so it still takes the input [15:45] it will produce the contextual embedding [15:47] and so on and then here it will predict [15:51] the next word I mean it sounds like it [15:53] has encoder decoder both but [15:56] fundamentally the architecture looks [15:58] little different for GPT but it is still [16:00] based on the Transformer architecture [16:04] the way they're trained is you take all [16:06] the text from Wikipedia crawl text from [16:08] internet book Corpus and you train these [16:12] models when you have this article for [16:14] example and you are having this sentence [16:16] developing an advanced crude see if I [16:19] give you this sentence most likely you [16:20] will say spacecraft or vehicle as a next [16:24] word you'll not say banana right like so [16:26] probability of having banana as a next [16:28] word in this sentence is very less [16:30] whereas these two words have a high [16:32] probability so we as humans have read so [16:35] much text so now we have learned this [16:38] art of predicting next word and same [16:40] thing goes on for B and GPT where they [16:45] understand the relationship between the [16:47] words the context in which they appear [16:49] so let's say if B during the training [16:52] has encountered so much text and every [16:55] time after this word crude if it has [16:59] seen this word spacecraft of or vehicle [17:03] uh it would not have seen words like [17:05] crude chair or crude banana right that [17:08] kind of words usually when it is going [17:10] through training it will not see it so [17:12] it will learn to predict the high [17:14] probability worse right so for word like [17:16] banana probability is going to be lower [17:19] same thing for this article when you go [17:21] through this kind of sentences right SI [17:24] engaging both alliances and hostilities [17:27] and there will be many more artic on [17:29] battles and Warriors and everywhere [17:32] after alliances there will be either [17:34] hostilities or negotiations there will [17:37] not be a word like chair so probability [17:39] of that will be very very low now when [17:43] you go through the training you are [17:45] going through all these words right so [17:47] all these words will form something [17:49] called a vocabulary so for a Model A [17:53] vocabulary will look something like this [17:55] now there is a difference between a word [17:57] and a token so for example here playing [18:02] is a word and one of the way to tokenize [18:05] is to have two token so one token is [18:08] play second token is ing okay so token [18:12] wise there are two tokens but word is [18:14] just once but just for uh understanding [18:19] purpose just for easy explanation you [18:21] can think of word as tokens technically [18:23] they're different but you can think of [18:25] words as token only okay so let's say [18:28] you have a vocabulary of all these [18:30] tokens let's say [18:32] 30,000 words what happens here is for [18:36] each of these words during the training [18:39] it will create those static embeddings [18:42] so for word made or let's say for word [18:45] and seven is the index and let's say [18:49] this is the static embedding vector and [18:52] the dimension or the size see there is a [18:54] dot dot dot so what is the size of this [18:56] well for bird it is 768 for GPT it's [19:02] 12,228 right so based on model the [19:04] dimension of your embedding Vector can [19:07] vary when you go through this training [19:11] for every token in your vocabulary you [19:14] will have this static embedding and this [19:17] whole table is called Static embedding [19:21] Matrix during the training it will also [19:24] learn few other things such as WQ WK WV [19:28] and you are like what the hell this is [19:30] well we will talk about this later but [19:33] for now just remember when you train [19:35] these models they are having this static [19:39] word embedding metrix which is the [19:41] static embedding for every token in your [19:44] entire vocabulary as well as they're [19:46] having the special metries WQ WK WV [19:50] which we'll talk about later let's have [19:52] a look inside the encoder and review [19:55] this specific two steps so you give a [19:59] sentence to your Transformer and it will [20:02] first tokenize it tokens are kind of [20:05] like words but for a word called there [20:08] will be two tokens call and Ed so it [20:11] will tokenize it and there are various [20:13] ways to tokenize your sentence this is [20:16] one of the ways it will also add special [20:19] tokens at the end and at the beginning [20:22] CLS and sep sep is for separators so if [20:25] you have two sentences between two [20:27] sentences there will be a separator and [20:29] CLS will be added at the beginning and [20:32] this I'm talking about bird then it will [20:35] also generate token IDs so for each of [20:38] these words there will be an index into [20:41] your vocabulary for example made is [20:44] 2532 which means in your entire [20:47] vocabulary which is just like a list [20:49] made word is at position [20:54] 2532 if you talk about bird it has total [20:58] 30,00 , 522 tokens and GPT has around [21:03] 50,000 tokens from these token ID so [21:07] step number one was generate token and [21:09] token IDs then you uh get the static [21:14] embedding for each of these tokens and [21:16] from where do you get it well we just [21:18] saw right during the training you are [21:21] generating this static word embedding [21:23] metrix so for each of the words or [21:26] tokens you have [21:29] the static word embedding so in the case [21:32] of bird the size of this will be 768 if [21:36] it is GPT it will be 12,000 you know [21:38] that long embedding metrics so you [21:41] produce that for each of the tokens and [21:44] then you will also create something [21:46] called a positional embedding now in the [21:49] language the word order matters okay so [21:52] if I put made before I it will change [21:55] the meaning of that sentence so the [21:56] order matters and the way Transformer [21:59] works is it will process the entire [22:02] input or sequence in parallel it is not [22:05] like RNN where it will process these [22:08] words one by one it will process this [22:11] sequence All In Parallel now it needs to [22:15] have knowledge on the order okay so for [22:18] that it uses a special technique called [22:21] positional embedding where it will add a [22:26] small Vector in each of these embeddings [22:29] okay so let's say this is the vector for [22:32] position number one this is the vector [22:34] for position number two and so on and [22:37] when you get uh this resulting Vector [22:41] this Vector will embed the knowledge of [22:45] position so this Vector will have a [22:48] knowledge that this is the first word [22:50] this Vector will have a knowledge that [22:52] this is the second word now how exactly [22:55] that is done well there is a math behind [22:57] it I'm not going to go into the math but [22:59] I'm showing you the formula from the [23:01] original Transformer paper so using this [23:04] formula you are essentially uh deriving [23:07] all these positional embeddings all [23:10] right so that was step two the first [23:13] step was to produce the static embedding [23:16] for each of the tokens and then the [23:19] second step is to add positional [23:21] embedding like this is a plus sign so [23:23] here at this point what you get is this [23:27] kind of position [23:29] embedding just like how my nephew needs [23:31] my attention words also need attention [23:34] of surrounding words in order to produce [23:38] the contextual embedding in [23:42] 2017 a groundbreaking research was done [23:46] when this paper attention is all you [23:48] need was published by bunch of Google [23:52] researchers and that completely [23:54] transformed the landscape of AI okay and [23:58] and this is the architecture that we are [24:00] talking about the architecture is taken [24:02] from this attention is all you need [24:05] paper so the way it works is when you [24:07] have this sentence the word Indian needs [24:11] attention from Dosa Dola Etc you can say [24:15] that all these words are attending to [24:19] this word Indian even the word b instead [24:22] of B in let's say here if I had Dil this [24:26] will become Italian instead of Indian so [24:29] this word bin is attending to this word [24:32] Indian Dosa Dola Etc is also attending [24:34] to this word Indian similarly uh words [24:38] Sweet Indian rice Etc are attending to [24:43] this word dish now how much they're [24:46] attending to this word well that [24:48] attention weight or attention score [24:50] might be different for example sweet [24:53] might be attending to this word dish by [24:55] 36 person let's say Indian is attending [24:58] in it by 14 person rice is attending it [25:01] by 18 person these are the adjectives [25:04] which will enrich the meaning of word [25:06] dish on the other hand the word I made [25:11] Etc are not enriching the meaning of [25:14] word this that much because instead of I [25:17] if I had Rahul or moan or David the [25:22] meaning of this word will not change [25:24] that much but instead of sweet if I have [25:27] spicy all of a sudden the embedding or [25:30] the meaning of this dish changes because [25:32] as a next word I will immediately have [25:34] Biryani instead of K the goal here is to [25:38] build this kind of attention weight or [25:41] attention score okay for each of these [25:43] words it's a matrix because for Dish all [25:47] the other words in that same sentence [25:50] are they are enriching the meaning of [25:53] that word okay so for word dish let's [25:56] say sweet is attending it by 36 person [26:00] Indian is attending it by 11 person and [26:03] so on and by the way I have just made up [26:04] these numbers just for explanation [26:06] purpose the word dish also uh attends to [26:11] that word itself right because dish [26:13] itself has some meaning dish means dish [26:15] right so that will also attend to itself [26:19] so for every word see right now for Dish [26:21] I have all this scores for Rice you will [26:24] have scores for Indian for every word [26:27] you will try to compute these attention [26:30] scores and then you will use this [26:33] concept of query key and value to uh [26:38] come up with the contextual embedding [26:40] now let me explain you query key and [26:42] value by going over analogy let's say [26:44] you're going to a library looking for a [26:47] book on quantum physics especially [26:49] Quantum computation you might have this [26:52] query that hey I'm looking for this [26:55] quantum physics book and this particular [26:57] person who is a librarian will use the [27:02] book index so he'll go to his computer [27:04] try to search for that book or maybe he [27:06] will go to this rack and locate a [27:08] specific rack which has a label quantum [27:11] mechanics okay so for him the key or the [27:15] index to locate that book is the label [27:19] on the rack you know in library you see [27:21] like history drama science those kind of [27:24] labels or you have book description okay [27:28] so based on book description the rack [27:31] label you will figure out the [27:34] appropriate book so The Book Rack book [27:37] description Etc is called key and then [27:41] the actual book content is your value so [27:44] let's say you pull this book okay and [27:46] whatever content the actual content of [27:48] that book is value let me give you [27:51] another example let's say there is a [27:53] college professor who wants to write an [27:55] essay on Quantum Computing and he needs [27:57] help help of bunch of students so when [28:00] he talks to these students moan says [28:03] that I know linear algebra Mera says [28:06] that I know quantum mechanics Bob will [28:09] say hey I know philosophy same way Kathy [28:12] knows computer science so here whatever [28:16] moan mea Bob Kathy are claiming about [28:20] their knowledge is called key and what [28:24] happens after that is each of these [28:26] students will start writing an essay so [28:29] teacher will say okay just go and write [28:33] um some bunch of paragraphs so mea moan [28:36] Kathy Bob wrote all these paragraphs [28:39] which are called value and then teacher [28:42] knows that mea knows most about quantum [28:45] mechanics okay so he will take 60% of [28:49] mea's content or mea's value he will [28:53] take 29% of kath's value because [28:56] computer science and quantum Computing [28:58] so that it's kind of related so he will [29:01] use 60% of mea's content 29% of Kathy's [29:05] content to formulate that final essay on [29:09] the other hand Bob's content he will use [29:12] only one person because the query and [29:15] key are not matching that much see Bob [29:17] has a knowledge on philosophy but our [29:19] query requires Quantum Computing so [29:22] query and key we can say they're not [29:24] matching in terms of math you can think [29:26] about Dot produ so let's say dot product [29:29] between query and key Vector is less [29:32] let's say only one person okay but in [29:35] the other case mea query and key dot [29:38] product is higher let's say 60% so you [29:40] will take 60% of mea's value which is [29:44] the essay written by mea on Quantum [29:48] Computing now same way for our sentence [29:51] the query for Dish is I want to know [29:54] about my modifiers okay I'm just giving [29:56] you analogy by the way way the real [29:59] working is little different but let's [30:01] say you are generating contextual [30:03] embedding for the word dish and the [30:05] query may look something like I want to [30:08] know about my modifiers right like my [30:10] adjectives all these adjectives which [30:12] modifies my meaning and the key will be [30:16] uh the description that each of these [30:18] words are giving about themselves for [30:20] example I will say I'm the subject of [30:22] the sentence made will say I indicate an [30:25] action or a verb similarly sweet will [30:27] say say I am an adjective describing [30:30] taste and so on so these are called keys [30:34] and based on the dot product between [30:37] query and key yeah you're trying to find [30:39] out you know which things are matching [30:42] so if if dish wants to know about [30:44] modifiers I think these are the [30:46] adjectives which modifies the meaning of [30:49] word dish so the score attention score [30:52] for these will be higher whereas the [30:55] tension score for these will be lower [30:58] now once you get all these attention [30:59] scores you need value so each of these [31:02] words will now say the value value means [31:05] uh the component that it is contributing [31:09] to that query so I will say Indian will [31:12] say the style or origin is Indian sweet [31:14] will say The Taste is sweet similarly [31:17] all these words will have specific value [31:21] and then uh let's consider the values of [31:24] only these four words I mean as such it [31:25] will use values of all the words but for [31:27] simpl let's say only these four words [31:30] these values by the way will be some [31:33] kind of vector we'll look into how [31:35] exactly those vectors are derived but [31:37] let's say these values are all these [31:39] vectors and query also has like dish [31:42] also has its own Vector right like this [31:43] is the static embedding so this is its [31:46] own vector and now what you do is in [31:48] static embedding you add all these [31:50] vectors and all these vectors you can [31:52] think about as ress indianness okay so [31:56] see this is how you add all of them okay [32:00] you add all of them actually the vector [32:03] of all the other words and you get the [32:06] final context of where embedding in [32:08] terms of the embedding space it is like [32:11] going from dish to ress indan ress and [32:14] so on so these vectors right ress [32:16] indianness sweetness are these vectors [32:20] okay this is just a mathematical [32:22] representation now let's look at how [32:24] those vectors are built so here you have [32:27] a query for Dish okay so let me just [32:30] represent it as a horizontal right this [32:32] was a vertical format this is horizontal [32:34] format the same thing for each of these [32:37] words or tokens you will first get their [32:40] embedding from our stating embedding [32:42] Matrix okay so these are static [32:44] embeddings for each of these words in [32:46] the case of bird the dimension is 768 [32:48] for GPT is 12,000 something let's say [32:51] for word dish this is my embedding let's [32:54] call it E7 that E7 you will multiply [32:58] with a special Matrix called WQ which [33:03] will have a [33:05] dimension uh of 64 by 768 so 768 is the [33:10] columns in order to perform matrix [33:12] multiplication The Columns in the first [33:14] Matrix should be equal to rows in the [33:16] second Matrix so this is 6 768 this is [33:19] 768 the rows in The Matrix the first [33:22] Matrix is 64 for bir for GPT is [33:26] different and when you do [33:28] uh this kind of matrix multiplication [33:31] you will get uh this quy Vector okay so [33:35] you will multiply this row with this [33:38] column okay so you multiply 50 with this [33:42] 0.9 [33:44] minus5 with [33:46] 1.07 65 with this and then you add them [33:50] all up you put them here then you take [33:53] the second row multiply 23 with this [33:56] minus 71 with this 1.58 with this and [33:59] you put that here and so on okay so this [34:03] is how you build a query Vector now WQ [34:06] here knows how to encode query of a [34:10] token for attention computation when we [34:14] train the model we already got the WQ [34:17] and WQ after the training is done it it [34:22] doesn't change okay after you do that [34:24] training sometime it is referred as [34:26] pre-training on huge amount of data you [34:30] build this WQ Matrix which doesn't [34:32] change okay so for a train model uh this [34:35] WQ will not change you multiply that [34:37] with specific embedding E7 let's say [34:41] this is a positional embedding you get [34:43] Q7 which is the query Vector for the [34:47] word dish and you repeat the same [34:49] process for all the words okay so how [34:51] you have Q7 for dish for Rice you will [34:54] have q6 Indian you have uh Q5 and so on [34:58] to summarize WQ here knows how to encode [35:03] query of a token for attention [35:05] computation and remember in one of the [35:07] previous slides I said that when the [35:10] model is strained it will have static [35:12] embedding metrix but it will also have [35:14] this WQ WK WV and that is what I was [35:18] referring to okay so we just talked [35:20] about WQ here the question now is during [35:23] the training how exactly we get WQ WK WV [35:27] well we take this Transformer [35:29] architecture and we train it on huge [35:32] amount of data so we take all the [35:33] Wikipedia text and we generate this kind [35:36] of X and Y pairs okay so you don't have [35:39] to manually label it this is called uh [35:42] self-supervised data set uh you don't [35:45] need a person to label it because you [35:47] can just split a sentence you can have a [35:48] sentence and the next word is your y [35:51] okay so this is your X this is your y [35:53] you feed X as an input and when the [35:56] model is not train TR it will not [35:58] predict right things it will make error [36:00] so let's say for this it produce Mexican [36:03] which is your why hat okay it's a [36:05] predicted value your actual value is [36:07] Indian so that is why you calculate [36:09] error and then you back propagate that [36:12] error through back propagation and chain [36:15] rule partial derivative and so on folks [36:17] you need to have understanding of how [36:20] back propagation Works what is a chain [36:22] rule you need to know all those deep [36:24] learning fundamentals okay I have [36:26] covered that in other modules if you're [36:29] part of my courses or boot camp you [36:32] would have seen those if you're watching [36:33] it from YouTube Again YouTube has uh [36:36] these kind of tutorials my channel has [36:38] these tutorials so you need to know how [36:40] the back propagation Works essentially [36:44] you are feeding this data set you're [36:45] Computing the error and you're back [36:47] propagating it throughout this [36:49] architecture and during that back [36:51] propagation when let's say you train [36:53] this on millions and millions of [36:55] sentences that is the time when uh this [36:59] WQ WK WV will be finalized inside this [37:04] model architecture now going back for [37:08] Dish query we computed this particular [37:11] query Vector next step is to compute the [37:16] key vectors okay so I gave this kind of [37:18] analogy description to uh get you an [37:21] intuitive understanding but in reality [37:24] these will be the vectors so let's see [37:27] how those vectors are formed so here I'm [37:30] taking the first token I and the keys [37:33] look something like this okay so here [37:36] you will take the positional embedding [37:38] the static embedding for the word I and [37:41] you will multiply that with another [37:44] magical Matrix WK once again WK after [37:48] your pre-training after that model is [37:50] trained it is fixed so you take that [37:53] Matrix and you uh figure out your K1 [37:57] okay here WK knows how to encode key of [38:01] a token for the attention computation [38:05] then you go to the next word compute K2 [38:08] next word compute K3 you do that for all [38:11] the words so now for all these words we [38:15] have these key vectors okay so you have [38:18] Q7 uh query Vector you have key vectors [38:22] and you take the dot product between [38:24] these two okay so q1 K1 Dot Q7 okay so [38:29] if you take these dot product between [38:31] these two vectors you'll get some number [38:34] right like [38:36] 3.33 57 101 whatever that number is it [38:40] it's a single number you will get that [38:43] for all the tokens okay and then you let [38:48] it pass through a soft Max function from [38:51] Deep planning fundamentals you should [38:52] know about softmax softmax will convert [38:55] bunch of values into probability distrib [38:57] ution so that when you add all these [38:59] values it will be one so soft Max is [39:02] converting all these discrete values [39:05] into probability distribution so that [39:07] you can express them as percentages and [39:10] the sum of all these percentage will be [39:12] one mathematically you can represent [39:15] this operation as soft Max between q and [39:18] KT now KT is K transpose okay so here Q [39:24] was a vector but if you talk about let's [39:26] say this K right so K is k1 K2 K3 so [39:30] it's not just one vector actually it's [39:32] like bunch of vectors so this can be [39:34] thought of as a matrix and to multiply [39:37] that you need to do a transpose see if [39:39] you are multiplying Q7 with K1 like a [39:42] single Vector you don't need to do [39:44] transpose but when you have Matrix you [39:46] need to do transpose okay so we'll use [39:48] this formula later on in the final [39:50] attention formula but for now just [39:52] remember that there is this kind of [39:54] formula as a Next Step once you have [39:57] comp Ed these attention scores or [39:59] attention weights you need to find the [40:03] value Vector right so this was a [40:05] descriptive uh understanding of value [40:07] Vector but the way value vectors are [40:09] derived is similar for each of the [40:12] tokens you get positional embedding [40:15] static embedding then you multiply that [40:18] with another Vector called WV you get V1 [40:22] and here WV knows how to encode value of [40:26] a token for attention computation okay [40:29] so you do that for all the words so V1 [40:32] V2 V3 V4 V7 and so on okay so for all [40:37] these words you will uh get their values [40:41] and you multiply that with the weight so [40:45] you will have more component from this [40:47] V4 Vector because it's like 36 person [40:50] but the component that you will use from [40:52] V1 will be very less 7 person okay so [40:56] see the sweetness you're taking [41:00] 36% uh here I don't have things in order [41:03] but you essentially add all the vectors [41:06] okay so you just add all of this [41:09] everything okay so here I'm not showing [41:11] everything but you kind of get an idea [41:13] so from static embedding you go all the [41:16] way to context aware embedding here's [41:19] the mathematical formula for attention [41:21] qk V where DK is a dimension of a key [41:26] vector [41:27] in case of GPT this is 128 so what they [41:31] do is they take um the entire 12 [41:36] 228 Dimension right for GPT the [41:40] dimension of the contextual embeddings [41:42] is 12 228 and you divide it by the [41:46] number of attention heads I think for [41:48] GPT is 96 and that's how you get 128 I [41:52] will explain this 96 a little later but [41:54] there is a way to derive this number 128 [41:58] so you do division by square root of [42:02] that just for numerical stability you [42:04] don't want this dot products to become [42:06] very high okay so to bring down that [42:08] number we do kind of scaling here and [42:11] you do soft Max and you multiply that [42:14] with this value V so far what we talked [42:17] about is a single attention block [42:20] actually there are multiple attention [42:23] blocks so that's what we'll cover next [42:25] let's understand what is multi head [42:28] attention so far we have seen this [42:31] picture where you take positional [42:34] embedding for each of the words in your [42:35] input sequence you let it go through [42:39] attention head which is basically taking [42:42] this WQ WK [42:45] WV and coming up with context our aware [42:49] embedding so that whole portion is [42:51] called one attention head in reality you [42:55] have multiple attention heads okay so [42:57] you have multiple attention heads each [43:00] of these heads are producing their own [43:02] context aware embedding which you will [43:05] add them up all together to get the [43:09] final context aware embedding now what [43:12] is the purpose of this multiple [43:13] attention heads one attention head will [43:16] be working on adjectives okay so for the [43:19] word dish sweet Indian rice Etc are [43:22] adjectives the second attention head [43:26] might be working on on a verb okay so [43:29] how this verb made uh affects the [43:32] contextual embedding of the word dish [43:34] the third attention block might be [43:36] looking at pronoun so you can think of [43:40] this as looking at different aspects of [43:43] a language or different aspects of that [43:47] context okay for the other sentence the [43:49] first attention head might be looking at [43:51] a cultural context such as Dosa Dola [43:54] Millet bread are all Indian Delicacies [43:57] whereas the second attention head might [43:58] be looking at the pronoun where instead [44:02] of the and B if I exchange the order of [44:05] these two uh here you will have Italian [44:07] similarly instead of you if I say I [44:11] again here you will have a different [44:13] word so there is a pronoun context the [44:15] third attention uh head might be looking [44:17] at action and timing you know you're [44:20] driving 20 minutes Drive Etc so the [44:23] purpose of multi- attention heads is to [44:26] allow the model [44:27] to focus on different aspects or [44:32] different types of relationships between [44:35] tokens in a language when you have [44:37] multiple tokens there is a different [44:39] type of relationship between these [44:41] tokens such as semantic positional [44:45] syntactic uh [44:47] simultaneously uh enriching the [44:49] contextual understanding of each uh [44:53] token so I want you to read this [44:55] sentence again uh I hope hope you get an [44:57] idea it is basically looking at [44:59] different aspects or different [45:01] relationship between the tokens to [45:04] enrich the [45:05] contextual understanding of each token [45:09] so here in this particular architecture [45:12] diagram see first we produced this uh [45:15] static embedding then we added this [45:17] positional encoding right so you got [45:19] positional encoding here uh you ignore [45:22] this normalization part for now [45:23] normalization is simple actually it's [45:25] like uh normal izing it to Value which [45:28] is zero mean and one standard deviation [45:31] and then looking at v k q kind of metrix [45:36] to uh use multi-headed attention to [45:40] derive ec1 ec2 these [45:44] individual uh contextual embedding and [45:46] you add all of them up to produce your [45:51] final context of our embedding which [45:53] will come here and by the way this is a [45:56] residual connection uh if you know about [45:58] deep learning you will have uh this um [46:03] residual connection that helps you uh [46:06] with a smooth gradient flow after this [46:09] block the next block is feed forward [46:12] Network so you'll ask me okay I already [46:15] have context over embedding now why do I [46:18] need this feed forward Network well the [46:21] thing is you don't have your final [46:23] context aware embedding yet so here at [46:26] this point [46:27] the embeddings are enriched but they are [46:30] not still fully furnished yet you have [46:33] to let it go through this feed forward [46:36] Network so what happens is you passed [46:40] your positional embedding through bunch [46:41] of attention heads and you got this [46:44] enriched contextual embedding that will [46:47] go through a fully connected neural [46:49] network layer okay so here the input [46:53] neurons will be same number of uh [46:56] elements as this embedding so in case of [46:59] bir let's say this will be [47:01] 768 for GPT it will be 12 [47:05] 228 and then in the hidden layer you can [47:08] have uh n n number of neurons and in the [47:11] output layer again you'll have same as [47:13] this one because this input and this [47:16] output will have a same size so if this [47:18] is 768 this will also be 768 okay so you [47:21] let it go through this feed forward [47:24] Network and the resulting embedding that [47:28] you get is even more enrich it's like a [47:30] more furnac product now this neural [47:34] network weights you know this will have [47:36] a lot of weights and parameters those [47:38] weights and parameters are set once [47:40] again during that training process so [47:43] when you're going through this XY pairs [47:45] right your training pairs you might have [47:48] hundreds and thousands of these [47:50] sentences when you're training that [47:51] Network during that training look at [47:54] this feed forward Network you know [47:55] during back propagation [47:57] those weights are getting adjusted and [48:00] it will help you refine your sentence [48:03] further now once you get enriched [48:06] embedding you will add that into your [48:08] original embedding and you get the final [48:11] now it is final now it's a final [48:14] contextually Rich embedding so the [48:16] purpose of feed forward network is it [48:19] will enrich each token embedding by [48:22] applying nonlinear transformation [48:25] because in the attention head you are [48:27] applying linear transformation here you [48:30] get an opportunity to apply nonlinear [48:32] transformation independently enabling [48:35] the model to learn complex patterns and [48:38] higher order features Beyond just the [48:41] contextual relationship see multi-head [48:43] attention is just capturing those [48:44] contextual relationships how these words [48:47] are related to each other but language [48:49] is nonlinear it's not just the [48:51] relationship right there are like some [48:52] nuances nonlinearity complexity all of [48:56] that can be captured by this fully [48:58] connected layer or feed forward Network [49:01] so to better visualize each of the words [49:04] in your sentence let's say you have I [49:06] made dish every word will go through [49:09] positional embedding and every [49:11] positional embedding goes through [49:13] multiple attention head so the embedding [49:15] for I will go through all the heads okay [49:18] so in GPT if you have 96 heads it will [49:21] go through all those 96 heads similarly [49:24] made will also go through 96 heads and [49:26] this this is happening in parallel it's [49:28] not like you process I first and made no [49:31] all of these things are happening in [49:33] parallel and each of these uh vectors [49:36] will also go through the feed forward [49:38] Network parall at the same time right so [49:41] the same network is available for each [49:42] of these words and you get all these [49:45] contextually enriched embeddings okay so [49:48] that comes here so after feed forward [49:51] Network here at this point you get all [49:55] these m Bings okay and then you have [49:59] this uh plus sign and normalization so [50:02] normalization layer by the way this Norm [50:05] is uh just it's ensuring that you have [50:08] stable learning improving the gradient [50:10] flow if you have deep learning [50:12] fundamentals you will understand what I [50:14] mean uh in machine learning generally [50:16] when you have all these wide range of [50:19] values if you normalize them let's say [50:21] you normalize them to zero and one you [50:23] get better control over your training [50:26] now you also notice this anx layers so [50:29] anx layers is basically for B let's say [50:33] if you have a b base model you have 12 [50:36] such layers okay so this is a [50:37] Transformer block so you kind of repeat [50:40] so you have one block then after that [50:42] you have another block so in case of BT [50:44] base model you have 12 layers B large [50:47] you have 24 layers in case of GPT again [50:51] there will be different number of layers [50:52] so that's what this NX layers means all [50:56] right f finally we are done with [50:58] understanding encoder I just want to [51:00] summarize we had an input sequence we [51:03] generated a static embedding here then [51:06] here we generated a positional embedding [51:08] then we have one Transformer block or NX [51:11] layer where we first normalize we use [51:14] vkv uh to compute attention score or [51:17] attention weight we have multiple such [51:20] heads and then you go through [51:22] normalization you have feed forward [51:24] Network you kind of ADD remember like [51:27] you have original embedding and then you [51:28] add that output and you get the final [51:32] contextual embedding you normalize it [51:34] and here at this point you are getting a [51:37] final [51:38] contextual uh very enriched embedding we [51:41] have covered most of the Transformer [51:43] architecture decoder is not going to [51:45] take uh much time so let's spend few [51:48] minutes understanding decoder so the [51:51] output of encoder is a contextual [51:54] embedding or context Rich embedding [51:57] which you give it as an input to decoder [52:00] and decoder will produce the next word [52:03] if you're working on next word [52:04] prediction if you're working on language [52:07] translation it will uh start with this [52:10] special token called start and then it [52:12] will produce May then another work here [52:15] banay and so on okay so that's a goal of [52:18] a decoder now here you will notice one [52:22] thing which is called [52:24] multi-headed cross attention okay so [52:27] let's understand what exactly is cross [52:29] attention let's say you have this [52:31] sentence I made kir which you want to [52:33] translate into Hindi here you will have [52:36] key vectors and value vectors as we have [52:39] discussed before but the query Vector [52:41] will be little different so query Vector [52:42] will be start and it will be like I'm [52:45] starting to generate translation what [52:47] part of the input should I focus on and [52:49] then when you have next word which is [52:51] man which is the first word in your [52:54] translation it will be like I generated [52:57] the subject what is the subject okay [53:00] then you will have here I generated the [53:02] subject and object help me complate the [53:05] sentence with a verb form so the query [53:08] part is little different see in the [53:11] previous example you had only one [53:13] sentence so I made key so you'll [53:15] generate a query from made let's say and [53:18] you will have key and value from same [53:20] sentence in case of language translation [53:23] it's little different uh so here we need [53:25] to use cross cross attention why cross [53:28] attention because query you are using it [53:30] from the translated sentence in Hindi [53:34] whereas key and values are being used [53:37] from the original sentence in English so [53:40] that is why it's called cross attention [53:43] in the diagram you can see here the V [53:45] and K values are coming from your [53:47] encoder encoder has processed this im [53:50] here okay so V and K are coming from [53:54] encod see this is the arrow whereas Q is [53:58] coming from the decoder itself right so [54:01] you know that Hindi sequence will be [54:03] produced here so man K banai Etc so that [54:07] query part is coming from here that is [54:10] why it is called cross attention in the [54:13] case of B we all know that there is only [54:15] encoder part decoder part is not there [54:18] in case of GPT the Transformer [54:20] architecture is little different okay so [54:22] that's the encoder part remaining part [54:24] we have already understood now let me uh [54:27] show one nice tool which can visually [54:30] show you this architecture someone has [54:32] built this nice visualization tool you [54:34] can go to Pol club. github.io [54:36] Transformer explainer and here you can [54:38] look into different examples right so [54:41] for example let's look at this sentence [54:43] as the space ship was approaching the it [54:47] will try to autocomplete that word and [54:50] say station right now each of these [54:53] words has the space ship first you had [54:57] this Dropout layer so we talking about [55:00] Transformer here so it will have a [55:02] Dropout layer the architecture is little [55:05] customized compared to the base [55:07] Transformer architecture then you have [55:09] this residual connection okay so [55:11] residual connection will take you from [55:13] here to here if you talk about [55:15] embeddings you have token embeddings for [55:18] each of these right see 768 is the size [55:21] then you have positional embedding okay [55:24] you add all this position and you get [55:25] final Vector this is a [55:27] positional embedding of a sentence and [55:31] you have residual connection then you [55:33] have q KV computation so if you look at [55:37] this particular block here qkv it will [55:41] kind of visually show you how you [55:43] compute q k v and get all those three [55:47] vectors right like query Vector key [55:50] vector and value vector and then uh you [55:54] have this output which you feed it to [55:57] MLP is your feed forward neural network [56:00] that we talked about okay so feed [56:01] forward neural network then you again [56:03] have a residual block and then you have [56:05] layer normalization and this is one [56:08] Transformer block you have multiple of [56:10] them like 11 okay so that Annex layer [56:13] that I was referring to is one block so [56:16] you have repeated blocks and in the end [56:18] you get this kind of soft Max [56:20] probability see the probability here is [56:23] a station okay so just play with this [56:26] particular Tool uh to get a better [56:29] understanding of this thing and I want [56:31] to give credits uh to this amazing [56:34] Channel called 3 blue one brown so if [56:37] you go to YouTube and type in [56:40] Transformer explain 3 blue one brown you [56:43] will find all these videos so I want you [56:45] to watch from video number 5678 onwards [56:50] he will have more videos as well uh [56:52] especially these three video dl5 dl6 dl7 [56:57] these three videos you must watch it [56:59] will enhance your understanding further [57:02] I myself learned a lot from this channel [57:06] so due credits to three blue one brown [57:09] all right that's it folks so that's that [57:11] was about Transformer I know it was a [57:12] long discussion there were many topics [57:15] that we covered but hopefully your [57:18] understanding is clear if you have any [57:20] question please feel free to ask [57:23] [Music] [57:26] a [57:27] [Music]