[0:00] Chad GPT is powered by a model called
[0:02] GPT which is based on a deep learning
[0:05] architecture called Transformers
[0:07] Transformers is the reason behind modern
[0:09] day AI boom as an AI Enthusiast when you
[0:12] start learning Transformers you will
[0:14] come across this complex diagram which
[0:16] will start giving you a headache
[0:18] immediately my goal for today's video is
[0:21] to explain you Transformers in a most
[0:23] simplified and intuitive manner we need
[0:26] to cover many different topics so this
[0:28] is going to be a long video attention
[0:31] and patience is all I need from you
[0:33] today when you type in a sentence in
[0:35] Gmail it tries to predict next word or
[0:37] next set of words this is possible
[0:39] because of a machine learning model
[0:41] called language model Google for example
[0:44] has this popular language model called
[0:46] bird which is powering hundreds of AI
[0:49] applications throughout the world GPT
[0:52] which is a model behind chat GPT is a
[0:55] large language model the reason it is
[0:57] called large language model is because
[0:59] it has billions of parameters it is much
[1:02] more capable and Powerful compared to
[1:05] bird and it is trained on humongous
[1:07] amount of data fundamentally though it
[1:10] is also doing the same thing which is
[1:12] when you type in a question in chat GPT
[1:15] it will predict the next word in that
[1:18] sentence and then it will take the
[1:20] original question and the next predicted
[1:23] word as an input and then predict the
[1:25] next word and then the next word and so
[1:28] on in the end it produces is a complete
[1:31] answer which almost sounds like a magic
[1:33] to summarize the goal of a language
[1:36] model is to predict a next word in a
[1:39] sentence now that we have understood
[1:41] this fundamental let's look into some of
[1:44] the topics which needs to be clarified
[1:46] before we dig into the actual
[1:48] architecture the first concept we need
[1:50] to understand is word embedding machine
[1:53] learning models do not understand text
[1:56] they understand numbers so we need to
[1:58] represent text as numbers let's say you
[2:01] have this word King you want to
[2:03] numerically represent it how would you
[2:06] do that well you can assign a fixed
[2:08] number you can have a vocabulary and you
[2:10] can assign just fixed static number but
[2:13] that will not capture the meaning of it
[2:15] when you're building a language model
[2:18] you have to represent words in such a
[2:21] way that they capture the meaning of
[2:24] that word one way to capture the meaning
[2:27] of this word King and represent it
[2:30] numerically is to ask bunch of questions
[2:33] for example does this person has
[2:35] Authority yes one do they have a tail no
[2:38] horse has a tail King doesn't have a
[2:40] tail are they Rich yes gender minus one
[2:42] is male one is female and so on what we
[2:46] just did is we created this Vector list
[2:50] of numbers which is a vector to
[2:53] represent this word King similarly we
[2:55] can represent the word queen as well not
[2:58] only that we can can represent bunch of
[3:01] words such as battle horse King and so
[3:04] on by asking set of questions okay so
[3:07] for battle they don't have authority are
[3:09] they an event yes do battle has tail no
[3:13] battle is an event it it doesn't have
[3:15] tail and so on similarly horse do they
[3:17] have authority well we'll just say 01 if
[3:21] it is King's horse maybe they have some
[3:23] Authority or maybe they have authority
[3:25] over their horse kids and so on
[3:29] similarly we can represent all these
[3:31] words in a numeric format and then we
[3:34] can take the vector of this word King
[3:39] and maybe we can do some mathematics
[3:41] with it we can say King minus man so
[3:45] here I'm taking the vector of man right
[3:49] which is uh this particular Vector plus
[3:51] woman which is this particular vector
[3:54] and when you do the math which is like 1
[3:56] - 2 + 2 will be
[4:00] Z and so on you get a vector which looks
[4:04] similar to Queen now this sounds like a
[4:06] magic we can do math with Words which is
[4:09] King minus man plus woman is equal to
[4:13] Queen here King was represented in five
[4:16] Dimensions when you look at the real
[4:19] life for example Google's word to wack
[4:22] model it has 300 dimensions and what are
[4:26] all these questions by the way well we
[4:28] actually don't know this has been
[4:30] trained through a neural network and we
[4:34] have processed huge amount of text such
[4:36] as all the Wikipedia articles all the
[4:39] books and text on internet to understand
[4:42] the relationship between these words and
[4:45] through that neural network training
[4:47] back propagation we came up with this
[4:49] Vector the example that I gave for King
[4:53] where we asked these questions Authority
[4:55] and so on that was just a madeup example
[4:58] for building intuition for word
[5:01] embeddings in real life we do not know
[5:05] what all these number means all we can
[5:08] say is these numbers are the features
[5:11] for this word and they capture the
[5:14] meaning of this word King let's say king
[5:16] is a three-dimensional embedding if you
[5:20] have to represent that in this 3D
[5:23] embedding space you can represent it
[5:25] like this where x axis has three like
[5:28] this three number y AIS has this eight
[5:30] number z-axis has this two number and so
[5:33] on I use three dimensions because as
[5:35] humans we can view only three dimensions
[5:38] we can't possibly view this 300
[5:40] Dimension okay but mathematically those
[5:43] 300 dimensions are possible models like
[5:46] GPT uses an embedding Vector which is
[5:49] even 12,000 Dimensions okay so it's a
[5:53] very rich High dimensional space that we
[5:55] are working with in threedimensional
[5:57] space you can have vectors for King king
[5:59] and queen that looks something like this
[6:02] and if you look at this Vector which is
[6:05] joining king and queen you can think of
[6:08] that as a gender Direction the benefit
[6:11] of this gender Direction Vector is that
[6:14] when you have another embedding for
[6:16] Uncle you can add that gender Direction
[6:19] and get the embedding for word Aunt
[6:23] similarly if you have father you can get
[6:25] mother if you have man you can uh get
[6:28] woman and so on and that allows you to
[6:30] do this amazing math such as king minus
[6:33] Queen plus uncle is equal to Aunt
[6:35] another example is you can have country
[6:38] to Capital City Direction Vector which
[6:41] you can use to add it in this embeding
[6:45] of Russia to get the embedding of Moscow
[6:48] you can do Russia minus Moscow plus
[6:50] Delhi equal to India now the embedding
[6:53] that we're talking about are static
[6:55] embeddings wtu and glow are two popular
[6:59] models
[7:00] which helps you get the static embedding
[7:02] static means fixed embedding for all
[7:05] these words you may ask how these
[7:07] embeddings are generated well I already
[7:09] answered the question which is you train
[7:11] a neural network model on humongous
[7:14] amount of text Wikipedia books and so on
[7:16] to understand the relationship between
[7:19] the words I'm not going to go into the
[7:21] mathematical details of word to you can
[7:24] refer to some other material on internet
[7:27] I have YouTube videos for that I'm not
[7:30] going to go into the math of that but
[7:33] just think that uh the neural network
[7:35] tries to understand the relationship
[7:37] between these words and creates these
[7:39] static embeddings now the problem with
[7:41] static embedding is that you can have a
[7:44] static embedding for this word track but
[7:48] based on a sentence that track can me
[7:50] mean different things right like here
[7:52] I'm saying the train will run on the
[7:54] track and my package is late help me
[7:56] track it so the meaning of track is
[7:59] little different and when you have
[8:00] static embedding you get into this
[8:03] problem where you're not able to
[8:04] represent this word properly based on
[8:07] the context of this sentence you will
[8:10] see same issue here for Dish you can
[8:13] have a fixed embedding but in this
[8:16] sentence I'm talking about rice dish if
[8:19] I had a cheese dish then the embedding
[8:21] of dish should be a little different
[8:23] because the meaning of that dish word is
[8:26] little different when I say rice dish
[8:27] versus cheese dish when you are working
[8:30] on predicting the next word for this
[8:33] sentence you can have words such as
[8:35] risotto itly Mexican rice but when I say
[8:39] I made an Indian rice dish call all of a
[8:42] sudden the probabilities of my next
[8:44] words will change I will have words such
[8:47] as idly Biryani ke if I add one more
[8:51] adjective and say I made a sweet Indian
[8:53] rice dish in that case again it will
[8:56] change I will not have Biryani as a next
[8:59] word prediction I will probably have K
[9:01] or Pongal to summarize to build an
[9:04] application like CH GPT just the static
[9:08] word embeddings are not enough what you
[9:10] need is contextual embedding let's
[9:13] understand contextual embedding a bit
[9:15] more in detail when you represent this
[9:18] word dish in your embedding space it is
[9:21] aesthetic embedding when you say rise
[9:25] this maybe there is another Vector in
[9:28] the same space which can accurately
[9:31] describe rice dis or which can correctly
[9:36] capture the meaning of word rice dish
[9:38] which can be Roto Biryani and so on the
[9:41] direction from dish to rice dish we can
[9:43] call it ress and when you add that ress
[9:46] Vector to Dish what you get is the
[9:49] embedding for a rice dish there is
[9:52] another vector or embedding for Indian
[9:55] rice dis and to go from Rice dis to
[9:57] Indian rice dis you need to probably add
[9:59] this vector or a direction called
[10:01] indianness and same thing for sweet
[10:04] Indian rice dish in order to generate
[10:07] contextual embedding what we need to do
[10:10] is take the original static embedding
[10:13] for the word dish and have all these
[10:16] other words influence that static
[10:19] embedding or change that static
[10:22] embedding so that it can capture the
[10:25] meaning of all these adjectives once you
[10:27] have done that and once you have a a
[10:29] contextual embedding for dash it won't
[10:31] be hard to predict the next word which
[10:33] is K look at this another sentence where
[10:37] I'm saying D loves Dosa Dola and Millet
[10:40] and so on B loves pasta and so on they
[10:43] both went out for a dinner and here BN
[10:47] said bro we'll go to a restaurant that
[10:49] you like and after some time they were
[10:52] in Indian restaurant now the way you
[10:55] predicted this word Indian was based on
[10:58] this cont
[11:00] such as D LS all these items which are
[11:03] part of the Indian Cuisine also bin said
[11:08] to D bro will go to a restaurant you
[11:11] like if bin said here instead of you if
[11:14] he had said I this word will become
[11:18] Italian instead of Indian right also
[11:22] instead of B if there was double here
[11:25] then also this thing will become Italian
[11:29] so you can understand that this uh
[11:31] prediction Indian uh is influenced by
[11:35] not just the few words which are prior
[11:38] to that word but it can be influenced by
[11:41] some words which are far out in that
[11:43] paragraph okay to summarize the
[11:45] objective for this intelligent teacher
[11:48] cat is to generate a contextual
[11:52] embedding and if you think about this
[11:53] embedding space mathematically speaking
[11:56] what you're doing is taking the static
[11:59] embedding for the word Dash and then
[12:01] adding the embeddings for all these
[12:05] vectors ress indianness all these
[12:07] adjectives and getting your final
[12:10] contextual embedding let's now dig into
[12:13] the Transformer architecture the
[12:14] architecture has two components encoder
[12:18] and decoder the purpose of encoder is to
[12:21] take the input sentence and generate the
[12:24] contextual embedding for each of the
[12:26] words or each of the tokens in that
[12:29] sentence once the contextual embedding
[12:32] is generated we feed that to a decoder
[12:36] here and we try to predict the next word
[12:39] so if you're working on the next word
[12:42] prediction you will predict the next
[12:44] word here for example it will be here
[12:46] when you talk about natural language
[12:48] processing there are multiple tasks so
[12:50] one task is to predict the next word the
[12:53] other task would be to translate the
[12:55] sentence here I'm translating the
[12:58] sentence from English English to Hindi
[13:00] in that case you will still produce the
[13:03] contextual embedding from encoder you
[13:06] feed that to decoder and decoder will
[13:10] start predicting the next word so here
[13:12] it will start with this fixed start
[13:15] token and then it will uh produce the
[13:19] probability of the next word so here the
[13:22] probability of this word man is highest
[13:25] and here you can have the entire
[13:27] vocabulary for example in the case of
[13:31] bird you have some 30,000 words so
[13:34] you'll have all the words in your
[13:36] language and you will say okay what is
[13:38] the highest probability of my next word
[13:41] then you put that word man into this
[13:44] input okay so there are two inputs
[13:47] actually one is the contextual embedding
[13:49] which is coming uh from here from the
[13:52] encoder and the other one is whatever
[13:55] output you have produced so far from the
[13:57] previous step you feed that okay as an
[14:00] input here and then it will produce the
[14:03] next word which is K once again you
[14:06] provide key here and it produces banai
[14:10] So eventually it produces the entire
[14:13] translated sentence all right so that's
[14:15] the objective of your encoder and
[14:18] decoder part whatever I talked about so
[14:21] far I was referring to an inference
[14:25] stage of uh neural networks whenever you
[14:28] have these uh deep learning model you
[14:31] have two stages one stage is the model
[14:33] is not trained it's like a baby baby is
[14:35] not trained yet and you train them right
[14:38] you send them to school you train them
[14:41] uh in your home at some point they
[14:43] become adult and they can figure things
[14:45] out on their own similarly a machine
[14:48] learning model goes through a training
[14:49] phase and when it is ready it starts
[14:53] working on the real world problem and
[14:55] that is called inference so whatever I
[14:57] talked about for for predicting the next
[15:00] word or translating sentence I was
[15:03] referring to inference stage throughout
[15:05] the discussion we'll be referring to two
[15:07] specific models called bird and GPT if
[15:11] you look at this architecture that's a
[15:13] generic architecture for a Transformer
[15:15] model Transformer model is a general
[15:18] concept whereas bird and GPT are
[15:21] specific model or specific
[15:23] implementations based on Transformer
[15:26] architecture if you look at bird
[15:27] architecture it has only the encoder
[15:31] part okay so only this part decoder part
[15:34] is missing so but will take the input
[15:36] sentence it will produce the contextual
[15:40] embedding and that's it whereas GPT has
[15:42] only decoder so it still takes the input
[15:45] it will produce the contextual embedding
[15:47] and so on and then here it will predict
[15:51] the next word I mean it sounds like it
[15:53] has encoder decoder both but
[15:56] fundamentally the architecture looks
[15:58] little different for GPT but it is still
[16:00] based on the Transformer architecture
[16:04] the way they're trained is you take all
[16:06] the text from Wikipedia crawl text from
[16:08] internet book Corpus and you train these
[16:12] models when you have this article for
[16:14] example and you are having this sentence
[16:16] developing an advanced crude see if I
[16:19] give you this sentence most likely you
[16:20] will say spacecraft or vehicle as a next
[16:24] word you'll not say banana right like so
[16:26] probability of having banana as a next
[16:28] word in this sentence is very less
[16:30] whereas these two words have a high
[16:32] probability so we as humans have read so
[16:35] much text so now we have learned this
[16:38] art of predicting next word and same
[16:40] thing goes on for B and GPT where they
[16:45] understand the relationship between the
[16:47] words the context in which they appear
[16:49] so let's say if B during the training
[16:52] has encountered so much text and every
[16:55] time after this word crude if it has
[16:59] seen this word spacecraft of or vehicle
[17:03] uh it would not have seen words like
[17:05] crude chair or crude banana right that
[17:08] kind of words usually when it is going
[17:10] through training it will not see it so
[17:12] it will learn to predict the high
[17:14] probability worse right so for word like
[17:16] banana probability is going to be lower
[17:19] same thing for this article when you go
[17:21] through this kind of sentences right SI
[17:24] engaging both alliances and hostilities
[17:27] and there will be many more artic on
[17:29] battles and Warriors and everywhere
[17:32] after alliances there will be either
[17:34] hostilities or negotiations there will
[17:37] not be a word like chair so probability
[17:39] of that will be very very low now when
[17:43] you go through the training you are
[17:45] going through all these words right so
[17:47] all these words will form something
[17:49] called a vocabulary so for a Model A
[17:53] vocabulary will look something like this
[17:55] now there is a difference between a word
[17:57] and a token so for example here playing
[18:02] is a word and one of the way to tokenize
[18:05] is to have two token so one token is
[18:08] play second token is ing okay so token
[18:12] wise there are two tokens but word is
[18:14] just once but just for uh understanding
[18:19] purpose just for easy explanation you
[18:21] can think of word as tokens technically
[18:23] they're different but you can think of
[18:25] words as token only okay so let's say
[18:28] you have a vocabulary of all these
[18:30] tokens let's say
[18:32] 30,000 words what happens here is for
[18:36] each of these words during the training
[18:39] it will create those static embeddings
[18:42] so for word made or let's say for word
[18:45] and seven is the index and let's say
[18:49] this is the static embedding vector and
[18:52] the dimension or the size see there is a
[18:54] dot dot dot so what is the size of this
[18:56] well for bird it is 768 for GPT it's
[19:02] 12,228 right so based on model the
[19:04] dimension of your embedding Vector can
[19:07] vary when you go through this training
[19:11] for every token in your vocabulary you
[19:14] will have this static embedding and this
[19:17] whole table is called Static embedding
[19:21] Matrix during the training it will also
[19:24] learn few other things such as WQ WK WV
[19:28] and you are like what the hell this is
[19:30] well we will talk about this later but
[19:33] for now just remember when you train
[19:35] these models they are having this static
[19:39] word embedding metrix which is the
[19:41] static embedding for every token in your
[19:44] entire vocabulary as well as they're
[19:46] having the special metries WQ WK WV
[19:50] which we'll talk about later let's have
[19:52] a look inside the encoder and review
[19:55] this specific two steps so you give a
[19:59] sentence to your Transformer and it will
[20:02] first tokenize it tokens are kind of
[20:05] like words but for a word called there
[20:08] will be two tokens call and Ed so it
[20:11] will tokenize it and there are various
[20:13] ways to tokenize your sentence this is
[20:16] one of the ways it will also add special
[20:19] tokens at the end and at the beginning
[20:22] CLS and sep sep is for separators so if
[20:25] you have two sentences between two
[20:27] sentences there will be a separator and
[20:29] CLS will be added at the beginning and
[20:32] this I'm talking about bird then it will
[20:35] also generate token IDs so for each of
[20:38] these words there will be an index into
[20:41] your vocabulary for example made is
[20:44] 2532 which means in your entire
[20:47] vocabulary which is just like a list
[20:49] made word is at position
[20:54] 2532 if you talk about bird it has total
[20:58] 30,00 , 522 tokens and GPT has around
[21:03] 50,000 tokens from these token ID so
[21:07] step number one was generate token and
[21:09] token IDs then you uh get the static
[21:14] embedding for each of these tokens and
[21:16] from where do you get it well we just
[21:18] saw right during the training you are
[21:21] generating this static word embedding
[21:23] metrix so for each of the words or
[21:26] tokens you have
[21:29] the static word embedding so in the case
[21:32] of bird the size of this will be 768 if
[21:36] it is GPT it will be 12,000 you know
[21:38] that long embedding metrics so you
[21:41] produce that for each of the tokens and
[21:44] then you will also create something
[21:46] called a positional embedding now in the
[21:49] language the word order matters okay so
[21:52] if I put made before I it will change
[21:55] the meaning of that sentence so the
[21:56] order matters and the way Transformer
[21:59] works is it will process the entire
[22:02] input or sequence in parallel it is not
[22:05] like RNN where it will process these
[22:08] words one by one it will process this
[22:11] sequence All In Parallel now it needs to
[22:15] have knowledge on the order okay so for
[22:18] that it uses a special technique called
[22:21] positional embedding where it will add a
[22:26] small Vector in each of these embeddings
[22:29] okay so let's say this is the vector for
[22:32] position number one this is the vector
[22:34] for position number two and so on and
[22:37] when you get uh this resulting Vector
[22:41] this Vector will embed the knowledge of
[22:45] position so this Vector will have a
[22:48] knowledge that this is the first word
[22:50] this Vector will have a knowledge that
[22:52] this is the second word now how exactly
[22:55] that is done well there is a math behind
[22:57] it I'm not going to go into the math but
[22:59] I'm showing you the formula from the
[23:01] original Transformer paper so using this
[23:04] formula you are essentially uh deriving
[23:07] all these positional embeddings all
[23:10] right so that was step two the first
[23:13] step was to produce the static embedding
[23:16] for each of the tokens and then the
[23:19] second step is to add positional
[23:21] embedding like this is a plus sign so
[23:23] here at this point what you get is this
[23:27] kind of position
[23:29] embedding just like how my nephew needs
[23:31] my attention words also need attention
[23:34] of surrounding words in order to produce
[23:38] the contextual embedding in
[23:42] 2017 a groundbreaking research was done
[23:46] when this paper attention is all you
[23:48] need was published by bunch of Google
[23:52] researchers and that completely
[23:54] transformed the landscape of AI okay and
[23:58] and this is the architecture that we are
[24:00] talking about the architecture is taken
[24:02] from this attention is all you need
[24:05] paper so the way it works is when you
[24:07] have this sentence the word Indian needs
[24:11] attention from Dosa Dola Etc you can say
[24:15] that all these words are attending to
[24:19] this word Indian even the word b instead
[24:22] of B in let's say here if I had Dil this
[24:26] will become Italian instead of Indian so
[24:29] this word bin is attending to this word
[24:32] Indian Dosa Dola Etc is also attending
[24:34] to this word Indian similarly uh words
[24:38] Sweet Indian rice Etc are attending to
[24:43] this word dish now how much they're
[24:46] attending to this word well that
[24:48] attention weight or attention score
[24:50] might be different for example sweet
[24:53] might be attending to this word dish by
[24:55] 36 person let's say Indian is attending
[24:58] in it by 14 person rice is attending it
[25:01] by 18 person these are the adjectives
[25:04] which will enrich the meaning of word
[25:06] dish on the other hand the word I made
[25:11] Etc are not enriching the meaning of
[25:14] word this that much because instead of I
[25:17] if I had Rahul or moan or David the
[25:22] meaning of this word will not change
[25:24] that much but instead of sweet if I have
[25:27] spicy all of a sudden the embedding or
[25:30] the meaning of this dish changes because
[25:32] as a next word I will immediately have
[25:34] Biryani instead of K the goal here is to
[25:38] build this kind of attention weight or
[25:41] attention score okay for each of these
[25:43] words it's a matrix because for Dish all
[25:47] the other words in that same sentence
[25:50] are they are enriching the meaning of
[25:53] that word okay so for word dish let's
[25:56] say sweet is attending it by 36 person
[26:00] Indian is attending it by 11 person and
[26:03] so on and by the way I have just made up
[26:04] these numbers just for explanation
[26:06] purpose the word dish also uh attends to
[26:11] that word itself right because dish
[26:13] itself has some meaning dish means dish
[26:15] right so that will also attend to itself
[26:19] so for every word see right now for Dish
[26:21] I have all this scores for Rice you will
[26:24] have scores for Indian for every word
[26:27] you will try to compute these attention
[26:30] scores and then you will use this
[26:33] concept of query key and value to uh
[26:38] come up with the contextual embedding
[26:40] now let me explain you query key and
[26:42] value by going over analogy let's say
[26:44] you're going to a library looking for a
[26:47] book on quantum physics especially
[26:49] Quantum computation you might have this
[26:52] query that hey I'm looking for this
[26:55] quantum physics book and this particular
[26:57] person who is a librarian will use the
[27:02] book index so he'll go to his computer
[27:04] try to search for that book or maybe he
[27:06] will go to this rack and locate a
[27:08] specific rack which has a label quantum
[27:11] mechanics okay so for him the key or the
[27:15] index to locate that book is the label
[27:19] on the rack you know in library you see
[27:21] like history drama science those kind of
[27:24] labels or you have book description okay
[27:28] so based on book description the rack
[27:31] label you will figure out the
[27:34] appropriate book so The Book Rack book
[27:37] description Etc is called key and then
[27:41] the actual book content is your value so
[27:44] let's say you pull this book okay and
[27:46] whatever content the actual content of
[27:48] that book is value let me give you
[27:51] another example let's say there is a
[27:53] college professor who wants to write an
[27:55] essay on Quantum Computing and he needs
[27:57] help help of bunch of students so when
[28:00] he talks to these students moan says
[28:03] that I know linear algebra Mera says
[28:06] that I know quantum mechanics Bob will
[28:09] say hey I know philosophy same way Kathy
[28:12] knows computer science so here whatever
[28:16] moan mea Bob Kathy are claiming about
[28:20] their knowledge is called key and what
[28:24] happens after that is each of these
[28:26] students will start writing an essay so
[28:29] teacher will say okay just go and write
[28:33] um some bunch of paragraphs so mea moan
[28:36] Kathy Bob wrote all these paragraphs
[28:39] which are called value and then teacher
[28:42] knows that mea knows most about quantum
[28:45] mechanics okay so he will take 60% of
[28:49] mea's content or mea's value he will
[28:53] take 29% of kath's value because
[28:56] computer science and quantum Computing
[28:58] so that it's kind of related so he will
[29:01] use 60% of mea's content 29% of Kathy's
[29:05] content to formulate that final essay on
[29:09] the other hand Bob's content he will use
[29:12] only one person because the query and
[29:15] key are not matching that much see Bob
[29:17] has a knowledge on philosophy but our
[29:19] query requires Quantum Computing so
[29:22] query and key we can say they're not
[29:24] matching in terms of math you can think
[29:26] about Dot produ so let's say dot product
[29:29] between query and key Vector is less
[29:32] let's say only one person okay but in
[29:35] the other case mea query and key dot
[29:38] product is higher let's say 60% so you
[29:40] will take 60% of mea's value which is
[29:44] the essay written by mea on Quantum
[29:48] Computing now same way for our sentence
[29:51] the query for Dish is I want to know
[29:54] about my modifiers okay I'm just giving
[29:56] you analogy by the way way the real
[29:59] working is little different but let's
[30:01] say you are generating contextual
[30:03] embedding for the word dish and the
[30:05] query may look something like I want to
[30:08] know about my modifiers right like my
[30:10] adjectives all these adjectives which
[30:12] modifies my meaning and the key will be
[30:16] uh the description that each of these
[30:18] words are giving about themselves for
[30:20] example I will say I'm the subject of
[30:22] the sentence made will say I indicate an
[30:25] action or a verb similarly sweet will
[30:27] say say I am an adjective describing
[30:30] taste and so on so these are called keys
[30:34] and based on the dot product between
[30:37] query and key yeah you're trying to find
[30:39] out you know which things are matching
[30:42] so if if dish wants to know about
[30:44] modifiers I think these are the
[30:46] adjectives which modifies the meaning of
[30:49] word dish so the score attention score
[30:52] for these will be higher whereas the
[30:55] tension score for these will be lower
[30:58] now once you get all these attention
[30:59] scores you need value so each of these
[31:02] words will now say the value value means
[31:05] uh the component that it is contributing
[31:09] to that query so I will say Indian will
[31:12] say the style or origin is Indian sweet
[31:14] will say The Taste is sweet similarly
[31:17] all these words will have specific value
[31:21] and then uh let's consider the values of
[31:24] only these four words I mean as such it
[31:25] will use values of all the words but for
[31:27] simpl let's say only these four words
[31:30] these values by the way will be some
[31:33] kind of vector we'll look into how
[31:35] exactly those vectors are derived but
[31:37] let's say these values are all these
[31:39] vectors and query also has like dish
[31:42] also has its own Vector right like this
[31:43] is the static embedding so this is its
[31:46] own vector and now what you do is in
[31:48] static embedding you add all these
[31:50] vectors and all these vectors you can
[31:52] think about as ress indianness okay so
[31:56] see this is how you add all of them okay
[32:00] you add all of them actually the vector
[32:03] of all the other words and you get the
[32:06] final context of where embedding in
[32:08] terms of the embedding space it is like
[32:11] going from dish to ress indan ress and
[32:14] so on so these vectors right ress
[32:16] indianness sweetness are these vectors
[32:20] okay this is just a mathematical
[32:22] representation now let's look at how
[32:24] those vectors are built so here you have
[32:27] a query for Dish okay so let me just
[32:30] represent it as a horizontal right this
[32:32] was a vertical format this is horizontal
[32:34] format the same thing for each of these
[32:37] words or tokens you will first get their
[32:40] embedding from our stating embedding
[32:42] Matrix okay so these are static
[32:44] embeddings for each of these words in
[32:46] the case of bird the dimension is 768
[32:48] for GPT is 12,000 something let's say
[32:51] for word dish this is my embedding let's
[32:54] call it E7 that E7 you will multiply
[32:58] with a special Matrix called WQ which
[33:03] will have a
[33:05] dimension uh of 64 by 768 so 768 is the
[33:10] columns in order to perform matrix
[33:12] multiplication The Columns in the first
[33:14] Matrix should be equal to rows in the
[33:16] second Matrix so this is 6 768 this is
[33:19] 768 the rows in The Matrix the first
[33:22] Matrix is 64 for bir for GPT is
[33:26] different and when you do
[33:28] uh this kind of matrix multiplication
[33:31] you will get uh this quy Vector okay so
[33:35] you will multiply this row with this
[33:38] column okay so you multiply 50 with this
[33:42] 0.9
[33:44] minus5 with
[33:46] 1.07 65 with this and then you add them
[33:50] all up you put them here then you take
[33:53] the second row multiply 23 with this
[33:56] minus 71 with this 1.58 with this and
[33:59] you put that here and so on okay so this
[34:03] is how you build a query Vector now WQ
[34:06] here knows how to encode query of a
[34:10] token for attention computation when we
[34:14] train the model we already got the WQ
[34:17] and WQ after the training is done it it
[34:22] doesn't change okay after you do that
[34:24] training sometime it is referred as
[34:26] pre-training on huge amount of data you
[34:30] build this WQ Matrix which doesn't
[34:32] change okay so for a train model uh this
[34:35] WQ will not change you multiply that
[34:37] with specific embedding E7 let's say
[34:41] this is a positional embedding you get
[34:43] Q7 which is the query Vector for the
[34:47] word dish and you repeat the same
[34:49] process for all the words okay so how
[34:51] you have Q7 for dish for Rice you will
[34:54] have q6 Indian you have uh Q5 and so on
[34:58] to summarize WQ here knows how to encode
[35:03] query of a token for attention
[35:05] computation and remember in one of the
[35:07] previous slides I said that when the
[35:10] model is strained it will have static
[35:12] embedding metrix but it will also have
[35:14] this WQ WK WV and that is what I was
[35:18] referring to okay so we just talked
[35:20] about WQ here the question now is during
[35:23] the training how exactly we get WQ WK WV
[35:27] well we take this Transformer
[35:29] architecture and we train it on huge
[35:32] amount of data so we take all the
[35:33] Wikipedia text and we generate this kind
[35:36] of X and Y pairs okay so you don't have
[35:39] to manually label it this is called uh
[35:42] self-supervised data set uh you don't
[35:45] need a person to label it because you
[35:47] can just split a sentence you can have a
[35:48] sentence and the next word is your y
[35:51] okay so this is your X this is your y
[35:53] you feed X as an input and when the
[35:56] model is not train TR it will not
[35:58] predict right things it will make error
[36:00] so let's say for this it produce Mexican
[36:03] which is your why hat okay it's a
[36:05] predicted value your actual value is
[36:07] Indian so that is why you calculate
[36:09] error and then you back propagate that
[36:12] error through back propagation and chain
[36:15] rule partial derivative and so on folks
[36:17] you need to have understanding of how
[36:20] back propagation Works what is a chain
[36:22] rule you need to know all those deep
[36:24] learning fundamentals okay I have
[36:26] covered that in other modules if you're
[36:29] part of my courses or boot camp you
[36:32] would have seen those if you're watching
[36:33] it from YouTube Again YouTube has uh
[36:36] these kind of tutorials my channel has
[36:38] these tutorials so you need to know how
[36:40] the back propagation Works essentially
[36:44] you are feeding this data set you're
[36:45] Computing the error and you're back
[36:47] propagating it throughout this
[36:49] architecture and during that back
[36:51] propagation when let's say you train
[36:53] this on millions and millions of
[36:55] sentences that is the time when uh this
[36:59] WQ WK WV will be finalized inside this
[37:04] model architecture now going back for
[37:08] Dish query we computed this particular
[37:11] query Vector next step is to compute the
[37:16] key vectors okay so I gave this kind of
[37:18] analogy description to uh get you an
[37:21] intuitive understanding but in reality
[37:24] these will be the vectors so let's see
[37:27] how those vectors are formed so here I'm
[37:30] taking the first token I and the keys
[37:33] look something like this okay so here
[37:36] you will take the positional embedding
[37:38] the static embedding for the word I and
[37:41] you will multiply that with another
[37:44] magical Matrix WK once again WK after
[37:48] your pre-training after that model is
[37:50] trained it is fixed so you take that
[37:53] Matrix and you uh figure out your K1
[37:57] okay here WK knows how to encode key of
[38:01] a token for the attention computation
[38:05] then you go to the next word compute K2
[38:08] next word compute K3 you do that for all
[38:11] the words so now for all these words we
[38:15] have these key vectors okay so you have
[38:18] Q7 uh query Vector you have key vectors
[38:22] and you take the dot product between
[38:24] these two okay so q1 K1 Dot Q7 okay so
[38:29] if you take these dot product between
[38:31] these two vectors you'll get some number
[38:34] right like
[38:36] 3.33 57 101 whatever that number is it
[38:40] it's a single number you will get that
[38:43] for all the tokens okay and then you let
[38:48] it pass through a soft Max function from
[38:51] Deep planning fundamentals you should
[38:52] know about softmax softmax will convert
[38:55] bunch of values into probability distrib
[38:57] ution so that when you add all these
[38:59] values it will be one so soft Max is
[39:02] converting all these discrete values
[39:05] into probability distribution so that
[39:07] you can express them as percentages and
[39:10] the sum of all these percentage will be
[39:12] one mathematically you can represent
[39:15] this operation as soft Max between q and
[39:18] KT now KT is K transpose okay so here Q
[39:24] was a vector but if you talk about let's
[39:26] say this K right so K is k1 K2 K3 so
[39:30] it's not just one vector actually it's
[39:32] like bunch of vectors so this can be
[39:34] thought of as a matrix and to multiply
[39:37] that you need to do a transpose see if
[39:39] you are multiplying Q7 with K1 like a
[39:42] single Vector you don't need to do
[39:44] transpose but when you have Matrix you
[39:46] need to do transpose okay so we'll use
[39:48] this formula later on in the final
[39:50] attention formula but for now just
[39:52] remember that there is this kind of
[39:54] formula as a Next Step once you have
[39:57] comp Ed these attention scores or
[39:59] attention weights you need to find the
[40:03] value Vector right so this was a
[40:05] descriptive uh understanding of value
[40:07] Vector but the way value vectors are
[40:09] derived is similar for each of the
[40:12] tokens you get positional embedding
[40:15] static embedding then you multiply that
[40:18] with another Vector called WV you get V1
[40:22] and here WV knows how to encode value of
[40:26] a token for attention computation okay
[40:29] so you do that for all the words so V1
[40:32] V2 V3 V4 V7 and so on okay so for all
[40:37] these words you will uh get their values
[40:41] and you multiply that with the weight so
[40:45] you will have more component from this
[40:47] V4 Vector because it's like 36 person
[40:50] but the component that you will use from
[40:52] V1 will be very less 7 person okay so
[40:56] see the sweetness you're taking
[41:00] 36% uh here I don't have things in order
[41:03] but you essentially add all the vectors
[41:06] okay so you just add all of this
[41:09] everything okay so here I'm not showing
[41:11] everything but you kind of get an idea
[41:13] so from static embedding you go all the
[41:16] way to context aware embedding here's
[41:19] the mathematical formula for attention
[41:21] qk V where DK is a dimension of a key
[41:26] vector
[41:27] in case of GPT this is 128 so what they
[41:31] do is they take um the entire 12
[41:36] 228 Dimension right for GPT the
[41:40] dimension of the contextual embeddings
[41:42] is 12 228 and you divide it by the
[41:46] number of attention heads I think for
[41:48] GPT is 96 and that's how you get 128 I
[41:52] will explain this 96 a little later but
[41:54] there is a way to derive this number 128
[41:58] so you do division by square root of
[42:02] that just for numerical stability you
[42:04] don't want this dot products to become
[42:06] very high okay so to bring down that
[42:08] number we do kind of scaling here and
[42:11] you do soft Max and you multiply that
[42:14] with this value V so far what we talked
[42:17] about is a single attention block
[42:20] actually there are multiple attention
[42:23] blocks so that's what we'll cover next
[42:25] let's understand what is multi head
[42:28] attention so far we have seen this
[42:31] picture where you take positional
[42:34] embedding for each of the words in your
[42:35] input sequence you let it go through
[42:39] attention head which is basically taking
[42:42] this WQ WK
[42:45] WV and coming up with context our aware
[42:49] embedding so that whole portion is
[42:51] called one attention head in reality you
[42:55] have multiple attention heads okay so
[42:57] you have multiple attention heads each
[43:00] of these heads are producing their own
[43:02] context aware embedding which you will
[43:05] add them up all together to get the
[43:09] final context aware embedding now what
[43:12] is the purpose of this multiple
[43:13] attention heads one attention head will
[43:16] be working on adjectives okay so for the
[43:19] word dish sweet Indian rice Etc are
[43:22] adjectives the second attention head
[43:26] might be working on on a verb okay so
[43:29] how this verb made uh affects the
[43:32] contextual embedding of the word dish
[43:34] the third attention block might be
[43:36] looking at pronoun so you can think of
[43:40] this as looking at different aspects of
[43:43] a language or different aspects of that
[43:47] context okay for the other sentence the
[43:49] first attention head might be looking at
[43:51] a cultural context such as Dosa Dola
[43:54] Millet bread are all Indian Delicacies
[43:57] whereas the second attention head might
[43:58] be looking at the pronoun where instead
[44:02] of the and B if I exchange the order of
[44:05] these two uh here you will have Italian
[44:07] similarly instead of you if I say I
[44:11] again here you will have a different
[44:13] word so there is a pronoun context the
[44:15] third attention uh head might be looking
[44:17] at action and timing you know you're
[44:20] driving 20 minutes Drive Etc so the
[44:23] purpose of multi- attention heads is to
[44:26] allow the model
[44:27] to focus on different aspects or
[44:32] different types of relationships between
[44:35] tokens in a language when you have
[44:37] multiple tokens there is a different
[44:39] type of relationship between these
[44:41] tokens such as semantic positional
[44:45] syntactic uh
[44:47] simultaneously uh enriching the
[44:49] contextual understanding of each uh
[44:53] token so I want you to read this
[44:55] sentence again uh I hope hope you get an
[44:57] idea it is basically looking at
[44:59] different aspects or different
[45:01] relationship between the tokens to
[45:04] enrich the
[45:05] contextual understanding of each token
[45:09] so here in this particular architecture
[45:12] diagram see first we produced this uh
[45:15] static embedding then we added this
[45:17] positional encoding right so you got
[45:19] positional encoding here uh you ignore
[45:22] this normalization part for now
[45:23] normalization is simple actually it's
[45:25] like uh normal izing it to Value which
[45:28] is zero mean and one standard deviation
[45:31] and then looking at v k q kind of metrix
[45:36] to uh use multi-headed attention to
[45:40] derive ec1 ec2 these
[45:44] individual uh contextual embedding and
[45:46] you add all of them up to produce your
[45:51] final context of our embedding which
[45:53] will come here and by the way this is a
[45:56] residual connection uh if you know about
[45:58] deep learning you will have uh this um
[46:03] residual connection that helps you uh
[46:06] with a smooth gradient flow after this
[46:09] block the next block is feed forward
[46:12] Network so you'll ask me okay I already
[46:15] have context over embedding now why do I
[46:18] need this feed forward Network well the
[46:21] thing is you don't have your final
[46:23] context aware embedding yet so here at
[46:26] this point
[46:27] the embeddings are enriched but they are
[46:30] not still fully furnished yet you have
[46:33] to let it go through this feed forward
[46:36] Network so what happens is you passed
[46:40] your positional embedding through bunch
[46:41] of attention heads and you got this
[46:44] enriched contextual embedding that will
[46:47] go through a fully connected neural
[46:49] network layer okay so here the input
[46:53] neurons will be same number of uh
[46:56] elements as this embedding so in case of
[46:59] bir let's say this will be
[47:01] 768 for GPT it will be 12
[47:05] 228 and then in the hidden layer you can
[47:08] have uh n n number of neurons and in the
[47:11] output layer again you'll have same as
[47:13] this one because this input and this
[47:16] output will have a same size so if this
[47:18] is 768 this will also be 768 okay so you
[47:21] let it go through this feed forward
[47:24] Network and the resulting embedding that
[47:28] you get is even more enrich it's like a
[47:30] more furnac product now this neural
[47:34] network weights you know this will have
[47:36] a lot of weights and parameters those
[47:38] weights and parameters are set once
[47:40] again during that training process so
[47:43] when you're going through this XY pairs
[47:45] right your training pairs you might have
[47:48] hundreds and thousands of these
[47:50] sentences when you're training that
[47:51] Network during that training look at
[47:54] this feed forward Network you know
[47:55] during back propagation
[47:57] those weights are getting adjusted and
[48:00] it will help you refine your sentence
[48:03] further now once you get enriched
[48:06] embedding you will add that into your
[48:08] original embedding and you get the final
[48:11] now it is final now it's a final
[48:14] contextually Rich embedding so the
[48:16] purpose of feed forward network is it
[48:19] will enrich each token embedding by
[48:22] applying nonlinear transformation
[48:25] because in the attention head you are
[48:27] applying linear transformation here you
[48:30] get an opportunity to apply nonlinear
[48:32] transformation independently enabling
[48:35] the model to learn complex patterns and
[48:38] higher order features Beyond just the
[48:41] contextual relationship see multi-head
[48:43] attention is just capturing those
[48:44] contextual relationships how these words
[48:47] are related to each other but language
[48:49] is nonlinear it's not just the
[48:51] relationship right there are like some
[48:52] nuances nonlinearity complexity all of
[48:56] that can be captured by this fully
[48:58] connected layer or feed forward Network
[49:01] so to better visualize each of the words
[49:04] in your sentence let's say you have I
[49:06] made dish every word will go through
[49:09] positional embedding and every
[49:11] positional embedding goes through
[49:13] multiple attention head so the embedding
[49:15] for I will go through all the heads okay
[49:18] so in GPT if you have 96 heads it will
[49:21] go through all those 96 heads similarly
[49:24] made will also go through 96 heads and
[49:26] this this is happening in parallel it's
[49:28] not like you process I first and made no
[49:31] all of these things are happening in
[49:33] parallel and each of these uh vectors
[49:36] will also go through the feed forward
[49:38] Network parall at the same time right so
[49:41] the same network is available for each
[49:42] of these words and you get all these
[49:45] contextually enriched embeddings okay so
[49:48] that comes here so after feed forward
[49:51] Network here at this point you get all
[49:55] these m Bings okay and then you have
[49:59] this uh plus sign and normalization so
[50:02] normalization layer by the way this Norm
[50:05] is uh just it's ensuring that you have
[50:08] stable learning improving the gradient
[50:10] flow if you have deep learning
[50:12] fundamentals you will understand what I
[50:14] mean uh in machine learning generally
[50:16] when you have all these wide range of
[50:19] values if you normalize them let's say
[50:21] you normalize them to zero and one you
[50:23] get better control over your training
[50:26] now you also notice this anx layers so
[50:29] anx layers is basically for B let's say
[50:33] if you have a b base model you have 12
[50:36] such layers okay so this is a
[50:37] Transformer block so you kind of repeat
[50:40] so you have one block then after that
[50:42] you have another block so in case of BT
[50:44] base model you have 12 layers B large
[50:47] you have 24 layers in case of GPT again
[50:51] there will be different number of layers
[50:52] so that's what this NX layers means all
[50:56] right f finally we are done with
[50:58] understanding encoder I just want to
[51:00] summarize we had an input sequence we
[51:03] generated a static embedding here then
[51:06] here we generated a positional embedding
[51:08] then we have one Transformer block or NX
[51:11] layer where we first normalize we use
[51:14] vkv uh to compute attention score or
[51:17] attention weight we have multiple such
[51:20] heads and then you go through
[51:22] normalization you have feed forward
[51:24] Network you kind of ADD remember like
[51:27] you have original embedding and then you
[51:28] add that output and you get the final
[51:32] contextual embedding you normalize it
[51:34] and here at this point you are getting a
[51:37] final
[51:38] contextual uh very enriched embedding we
[51:41] have covered most of the Transformer
[51:43] architecture decoder is not going to
[51:45] take uh much time so let's spend few
[51:48] minutes understanding decoder so the
[51:51] output of encoder is a contextual
[51:54] embedding or context Rich embedding
[51:57] which you give it as an input to decoder
[52:00] and decoder will produce the next word
[52:03] if you're working on next word
[52:04] prediction if you're working on language
[52:07] translation it will uh start with this
[52:10] special token called start and then it
[52:12] will produce May then another work here
[52:15] banay and so on okay so that's a goal of
[52:18] a decoder now here you will notice one
[52:22] thing which is called
[52:24] multi-headed cross attention okay so
[52:27] let's understand what exactly is cross
[52:29] attention let's say you have this
[52:31] sentence I made kir which you want to
[52:33] translate into Hindi here you will have
[52:36] key vectors and value vectors as we have
[52:39] discussed before but the query Vector
[52:41] will be little different so query Vector
[52:42] will be start and it will be like I'm
[52:45] starting to generate translation what
[52:47] part of the input should I focus on and
[52:49] then when you have next word which is
[52:51] man which is the first word in your
[52:54] translation it will be like I generated
[52:57] the subject what is the subject okay
[53:00] then you will have here I generated the
[53:02] subject and object help me complate the
[53:05] sentence with a verb form so the query
[53:08] part is little different see in the
[53:11] previous example you had only one
[53:13] sentence so I made key so you'll
[53:15] generate a query from made let's say and
[53:18] you will have key and value from same
[53:20] sentence in case of language translation
[53:23] it's little different uh so here we need
[53:25] to use cross cross attention why cross
[53:28] attention because query you are using it
[53:30] from the translated sentence in Hindi
[53:34] whereas key and values are being used
[53:37] from the original sentence in English so
[53:40] that is why it's called cross attention
[53:43] in the diagram you can see here the V
[53:45] and K values are coming from your
[53:47] encoder encoder has processed this im
[53:50] here okay so V and K are coming from
[53:54] encod see this is the arrow whereas Q is
[53:58] coming from the decoder itself right so
[54:01] you know that Hindi sequence will be
[54:03] produced here so man K banai Etc so that
[54:07] query part is coming from here that is
[54:10] why it is called cross attention in the
[54:13] case of B we all know that there is only
[54:15] encoder part decoder part is not there
[54:18] in case of GPT the Transformer
[54:20] architecture is little different okay so
[54:22] that's the encoder part remaining part
[54:24] we have already understood now let me uh
[54:27] show one nice tool which can visually
[54:30] show you this architecture someone has
[54:32] built this nice visualization tool you
[54:34] can go to Pol club. github.io
[54:36] Transformer explainer and here you can
[54:38] look into different examples right so
[54:41] for example let's look at this sentence
[54:43] as the space ship was approaching the it
[54:47] will try to autocomplete that word and
[54:50] say station right now each of these
[54:53] words has the space ship first you had
[54:57] this Dropout layer so we talking about
[55:00] Transformer here so it will have a
[55:02] Dropout layer the architecture is little
[55:05] customized compared to the base
[55:07] Transformer architecture then you have
[55:09] this residual connection okay so
[55:11] residual connection will take you from
[55:13] here to here if you talk about
[55:15] embeddings you have token embeddings for
[55:18] each of these right see 768 is the size
[55:21] then you have positional embedding okay
[55:24] you add all this position and you get
[55:25] final Vector this is a
[55:27] positional embedding of a sentence and
[55:31] you have residual connection then you
[55:33] have q KV computation so if you look at
[55:37] this particular block here qkv it will
[55:41] kind of visually show you how you
[55:43] compute q k v and get all those three
[55:47] vectors right like query Vector key
[55:50] vector and value vector and then uh you
[55:54] have this output which you feed it to
[55:57] MLP is your feed forward neural network
[56:00] that we talked about okay so feed
[56:01] forward neural network then you again
[56:03] have a residual block and then you have
[56:05] layer normalization and this is one
[56:08] Transformer block you have multiple of
[56:10] them like 11 okay so that Annex layer
[56:13] that I was referring to is one block so
[56:16] you have repeated blocks and in the end
[56:18] you get this kind of soft Max
[56:20] probability see the probability here is
[56:23] a station okay so just play with this
[56:26] particular Tool uh to get a better
[56:29] understanding of this thing and I want
[56:31] to give credits uh to this amazing
[56:34] Channel called 3 blue one brown so if
[56:37] you go to YouTube and type in
[56:40] Transformer explain 3 blue one brown you
[56:43] will find all these videos so I want you
[56:45] to watch from video number 5678 onwards
[56:50] he will have more videos as well uh
[56:52] especially these three video dl5 dl6 dl7
[56:57] these three videos you must watch it
[56:59] will enhance your understanding further
[57:02] I myself learned a lot from this channel
[57:06] so due credits to three blue one brown
[57:09] all right that's it folks so that's that
[57:11] was about Transformer I know it was a
[57:12] long discussion there were many topics
[57:15] that we covered but hopefully your
[57:18] understanding is clear if you have any
[57:20] question please feel free to ask
[57:23] [Music]
[57:26] a
[57:27] [Music]