TubeSum ← Transcribe a video

Transformers Explained | Simple Explanation of Transformers

Transcribed Jun 18, 2026 Watch on YouTube ↗
Beginner 57 min read For: AI enthusiasts and beginners wanting an intuitive understanding of the Transformer architecture without heavy math.
414.6K
Views
8.9K
Likes
322
Comments
81
Dislikes
2.2%
📈 Moderate

AI Summary

This video provides an intuitive and simplified explanation of the Transformer architecture, the core model behind modern AI like ChatGPT. It covers foundational concepts like word embeddings, attention mechanisms, and the encoder-decoder structure, aiming to demystify the complex diagram commonly associated with Transformers.

[[0:00]]
GPT and Transformers

ChatGPT is powered by GPT, a large language model based on the Transformer architecture, which is the reason for the modern AI boom.

[[0:33]]
Language Model Goal

The fundamental goal of a language model (like GPT) is to predict the next word in a sentence, iteratively generating a complete answer.

[[1:50]]
Word Embeddings Intro

Machine learning models require numerical input; word embeddings convert words into vectors that capture their meaning, enabling operations like King - Man + Woman = Queen.

[[4:58]]
Static vs Contextual Embeddings

Static embeddings (e.g., from Word2Vec) assign a fixed vector to each word, which fails to capture different meanings in different contexts (e.g., 'track' vs 'dish'). Contextual embeddings are dynamic and change based on surrounding words.

[[12:13]]
Transformer Architecture Overview

The Transformer has two main components: an encoder that generates contextual embeddings for input tokens, and a decoder that uses these embeddings to predict the next word or translate a sentence.

[[15:10]]
BERT and GPT Models

BERT uses only the encoder part of the Transformer, while GPT uses only the decoder. Both are implementations of the same underlying architecture.

[[19:52]]
Encoder Inside: Tokenization & Embeddings

The encoder first tokenizes the input sentence, converts tokens to IDs, retrieves static embeddings (e.g., 768 dimensions for BERT, 12,228 for GPT), and adds positional embeddings to encode word order.

[[21:34]]
Attention Is All You Need

The core innovation is the attention mechanism, where each word 'attends' to other words in the sentence to enrich its contextual embedding. The attention weight determines how much each word influences another.

[[26:38]]
Query, Key, Value (QKV)

Attention uses Query, Key, and Value vectors. The Query (from target token) is matched with Keys (from all tokens) via dot product to compute attention scores. These scores are used to weight Values, producing a context-aware embedding.

[[42:20]]
Multi-Head Attention

Instead of one attention mechanism, Transformers use multiple heads (e.g., 96 in GPT), each focusing on different relationships (e.g., adjectives, verbs, pronouns) to enrich the contextual embedding.

[[46:12]]
Feed-Forward Network (FFN)

After multi-head attention, a feed-forward network applies a non-linear transformation to each token embedding, enabling the model to learn complex patterns beyond just contextual relationships.

[[50:58]]
Decoder: Cross-Attention

The decoder uses cross-attention, where the Query comes from the decoder (e.g., the translated sentence), but the Key and Value come from the encoder (the original sentence). This is crucial for tasks like translation.

The Transformer architecture, with its encoder-decoder structure, attention mechanisms, and multi-head design, is the foundation of modern LLMs. Understanding its components—from tokenization and embeddings to QKV and feed-forward networks—demystifies how models like GPT and BERT work.

Clickbait Check

85% Legit

"The title promises a simplified explanation, and the video delivers on that promise with an intuitive breakdown of complex concepts, though it is quite long and doesn't fully cover every detail."

Mentioned in this Video

Study Flashcards (10)

What is the fundamental goal of a language model like GPT?

easy Click to reveal answer

To predict the next word in a sentence.

01:36

How are word embeddings created to capture meaning?

medium Click to reveal answer

Models are trained on massive text data (e.g., Wikipedia, books) to learn relationships between words, resulting in numerical vectors that represent meaning.

04:30

What is the difference between static and contextual embeddings?

medium Click to reveal answer

Static embeddings have a fixed vector for each word, while contextual embeddings change based on the surrounding words in a sentence.

09:10

What are the two main components of a Transformer?

easy Click to reveal answer

Encoder and decoder.

12:14

Which Transformer component does BERT use, and which does GPT use?

medium Click to reveal answer

BERT uses only the encoder, while GPT uses only the decoder.

15:27

What is the purpose of positional embeddings?

medium Click to reveal answer

To encode the order of words in a sequence, as the Transformer processes all words in parallel.

20:16

What does the formula `softmax(Q * K^T / sqrt(d_k)) * V` represent?

hard Click to reveal answer

The attention mechanism, where Q is Query, K is Key, V is Value, and d_k is the dimension of the Key.

41:19

What is multi-head attention, and why is it used?

hard Click to reveal answer

It uses multiple attention heads in parallel, each focusing on different aspects (e.g., adjectives, verbs) to enrich the contextual understanding of each token.

42:20

What is the role of the feed-forward network after multi-head attention?

hard Click to reveal answer

It applies a nonlinear transformation to each token embedding, enabling the model to learn complex patterns and higher-order features.

46:12

How does cross-attention differ in the decoder for translation?

hard Click to reveal answer

The Query comes from the decoder (the translated sentence), but the Key and Value come from the encoder (the original sentence).

51:23

💡 Key Takeaways

📊

Transformer Foundation

Establishes that GPT is based on the Transformer architecture, the foundation of modern AI.

[0:02]
📊

Word Embedding Magic

Demonstrates that word embeddings enable mathematical operations like 'King - Man + Woman = Queen', showing semantic relationships.

[4:06]
💡

Architecture Overview

Clearly explains the encoder-decoder structure of the Transformer, which is essential for understanding its workflow.

[12:13]
🔧

Attention Mechanism

Introduces the core idea of attention, where words 'attend' to each other to form contextual embeddings.

[21:34]
💡

Multi-Head Attention Purpose

Explains why multiple attention heads are needed: to capture different types of relationships (e.g., syntactic, semantic).

[42:20]

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

Math with Words: King - Man + Woman = Queen!

32s

Shows the surprising ability to do arithmetic with word vectors, a mind-blowing insight into AI.

▶ Play Clip

Why One Word Can Mean Many Things (Static vs Contextual)

57s

Demonstrates a fundamental limitation of AI models and how they overcome it with context.

▶ Play Clip

Attention Explained Like a College Professor

48s

Relatable analogy makes complex attention mechanism easy to understand.

▶ Play Clip

AI Training: No Labels Needed!

33s

Explains self-supervised learning, a core concept behind modern AI advancements.

▶ Play Clip

AI's Secret: Multiple Attention Heads for Different Jobs

53s

Reveals how AI models focus on different linguistic aspects simultaneously.

▶ Play Clip

[00:00] Chad GPT is powered by a model called

[00:02] GPT which is based on a deep learning

[00:05] architecture called Transformers

[00:07] Transformers is the reason behind modern

[00:09] day AI boom as an AI Enthusiast when you

[00:12] start learning Transformers you will

[00:14] come across this complex diagram which

[00:16] will start giving you a headache

[00:18] immediately my goal for today's video is

[00:21] to explain you Transformers in a most

[00:23] simplified and intuitive manner we need

[00:26] to cover many different topics so this

[00:28] is going to be a long video attention

[00:31] and patience is all I need from you

[00:33] today when you type in a sentence in

[00:35] Gmail it tries to predict next word or

[00:37] next set of words this is possible

[00:39] because of a machine learning model

[00:41] called language model Google for example

[00:44] has this popular language model called

[00:46] bird which is powering hundreds of AI

[00:49] applications throughout the world GPT

[00:52] which is a model behind chat GPT is a

[00:55] large language model the reason it is

[00:57] called large language model is because

[00:59] it has billions of parameters it is much

[01:02] more capable and Powerful compared to

[01:05] bird and it is trained on humongous

[01:07] amount of data fundamentally though it

[01:10] is also doing the same thing which is

[01:12] when you type in a question in chat GPT

[01:15] it will predict the next word in that

[01:18] sentence and then it will take the

[01:20] original question and the next predicted

[01:23] word as an input and then predict the

[01:25] next word and then the next word and so

[01:28] on in the end it produces is a complete

[01:31] answer which almost sounds like a magic

[01:33] to summarize the goal of a language

[01:36] model is to predict a next word in a

[01:39] sentence now that we have understood

[01:41] this fundamental let's look into some of

[01:44] the topics which needs to be clarified

[01:46] before we dig into the actual

[01:48] architecture the first concept we need

[01:50] to understand is word embedding machine

[01:53] learning models do not understand text

[01:56] they understand numbers so we need to

[01:58] represent text as numbers let's say you

[02:01] have this word King you want to

[02:03] numerically represent it how would you

[02:06] do that well you can assign a fixed

[02:08] number you can have a vocabulary and you

[02:10] can assign just fixed static number but

[02:13] that will not capture the meaning of it

[02:15] when you're building a language model

[02:18] you have to represent words in such a

[02:21] way that they capture the meaning of

[02:24] that word one way to capture the meaning

[02:27] of this word King and represent it

[02:30] numerically is to ask bunch of questions

[02:33] for example does this person has

[02:35] Authority yes one do they have a tail no

[02:38] horse has a tail King doesn't have a

[02:40] tail are they Rich yes gender minus one

[02:42] is male one is female and so on what we

[02:46] just did is we created this Vector list

[02:50] of numbers which is a vector to

[02:53] represent this word King similarly we

[02:55] can represent the word queen as well not

[02:58] only that we can can represent bunch of

[03:01] words such as battle horse King and so

[03:04] on by asking set of questions okay so

[03:07] for battle they don't have authority are

[03:09] they an event yes do battle has tail no

[03:13] battle is an event it it doesn't have

[03:15] tail and so on similarly horse do they

[03:17] have authority well we'll just say 01 if

[03:21] it is King's horse maybe they have some

[03:23] Authority or maybe they have authority

[03:25] over their horse kids and so on

[03:29] similarly we can represent all these

[03:31] words in a numeric format and then we

[03:34] can take the vector of this word King

[03:39] and maybe we can do some mathematics

[03:41] with it we can say King minus man so

[03:45] here I'm taking the vector of man right

[03:49] which is uh this particular Vector plus

[03:51] woman which is this particular vector

[03:54] and when you do the math which is like 1

[03:56] - 2 + 2 will be

[04:00] Z and so on you get a vector which looks

[04:04] similar to Queen now this sounds like a

[04:06] magic we can do math with Words which is

[04:09] King minus man plus woman is equal to

[04:13] Queen here King was represented in five

[04:16] Dimensions when you look at the real

[04:19] life for example Google's word to wack

[04:22] model it has 300 dimensions and what are

[04:26] all these questions by the way well we

[04:28] actually don't know this has been

[04:30] trained through a neural network and we

[04:34] have processed huge amount of text such

[04:36] as all the Wikipedia articles all the

[04:39] books and text on internet to understand

[04:42] the relationship between these words and

[04:45] through that neural network training

[04:47] back propagation we came up with this

[04:49] Vector the example that I gave for King

[04:53] where we asked these questions Authority

[04:55] and so on that was just a madeup example

[04:58] for building intuition for word

[05:01] embeddings in real life we do not know

[05:05] what all these number means all we can

[05:08] say is these numbers are the features

[05:11] for this word and they capture the

[05:14] meaning of this word King let's say king

[05:16] is a three-dimensional embedding if you

[05:20] have to represent that in this 3D

[05:23] embedding space you can represent it

[05:25] like this where x axis has three like

[05:28] this three number y AIS has this eight

[05:30] number z-axis has this two number and so

[05:33] on I use three dimensions because as

[05:35] humans we can view only three dimensions

[05:38] we can't possibly view this 300

[05:40] Dimension okay but mathematically those

[05:43] 300 dimensions are possible models like

[05:46] GPT uses an embedding Vector which is

[05:49] even 12,000 Dimensions okay so it's a

[05:53] very rich High dimensional space that we

[05:55] are working with in threedimensional

[05:57] space you can have vectors for King king

[05:59] and queen that looks something like this

[06:02] and if you look at this Vector which is

[06:05] joining king and queen you can think of

[06:08] that as a gender Direction the benefit

[06:11] of this gender Direction Vector is that

[06:14] when you have another embedding for

[06:16] Uncle you can add that gender Direction

[06:19] and get the embedding for word Aunt

[06:23] similarly if you have father you can get

[06:25] mother if you have man you can uh get

[06:28] woman and so on and that allows you to

[06:30] do this amazing math such as king minus

[06:33] Queen plus uncle is equal to Aunt

[06:35] another example is you can have country

[06:38] to Capital City Direction Vector which

[06:41] you can use to add it in this embeding

[06:45] of Russia to get the embedding of Moscow

[06:48] you can do Russia minus Moscow plus

[06:50] Delhi equal to India now the embedding

[06:53] that we're talking about are static

[06:55] embeddings wtu and glow are two popular

[06:59] models

[07:00] which helps you get the static embedding

[07:02] static means fixed embedding for all

[07:05] these words you may ask how these

[07:07] embeddings are generated well I already

[07:09] answered the question which is you train

[07:11] a neural network model on humongous

[07:14] amount of text Wikipedia books and so on

[07:16] to understand the relationship between

[07:19] the words I'm not going to go into the

[07:21] mathematical details of word to you can

[07:24] refer to some other material on internet

[07:27] I have YouTube videos for that I'm not

[07:30] going to go into the math of that but

[07:33] just think that uh the neural network

[07:35] tries to understand the relationship

[07:37] between these words and creates these

[07:39] static embeddings now the problem with

[07:41] static embedding is that you can have a

[07:44] static embedding for this word track but

[07:48] based on a sentence that track can me

[07:50] mean different things right like here

[07:52] I'm saying the train will run on the

[07:54] track and my package is late help me

[07:56] track it so the meaning of track is

[07:59] little different and when you have

[08:00] static embedding you get into this

[08:03] problem where you're not able to

[08:04] represent this word properly based on

[08:07] the context of this sentence you will

[08:10] see same issue here for Dish you can

[08:13] have a fixed embedding but in this

[08:16] sentence I'm talking about rice dish if

[08:19] I had a cheese dish then the embedding

[08:21] of dish should be a little different

[08:23] because the meaning of that dish word is

[08:26] little different when I say rice dish

[08:27] versus cheese dish when you are working

[08:30] on predicting the next word for this

[08:33] sentence you can have words such as

[08:35] risotto itly Mexican rice but when I say

[08:39] I made an Indian rice dish call all of a

[08:42] sudden the probabilities of my next

[08:44] words will change I will have words such

[08:47] as idly Biryani ke if I add one more

[08:51] adjective and say I made a sweet Indian

[08:53] rice dish in that case again it will

[08:56] change I will not have Biryani as a next

[08:59] word prediction I will probably have K

[09:01] or Pongal to summarize to build an

[09:04] application like CH GPT just the static

[09:08] word embeddings are not enough what you

[09:10] need is contextual embedding let's

[09:13] understand contextual embedding a bit

[09:15] more in detail when you represent this

[09:18] word dish in your embedding space it is

[09:21] aesthetic embedding when you say rise

[09:25] this maybe there is another Vector in

[09:28] the same space which can accurately

[09:31] describe rice dis or which can correctly

[09:36] capture the meaning of word rice dish

[09:38] which can be Roto Biryani and so on the

[09:41] direction from dish to rice dish we can

[09:43] call it ress and when you add that ress

[09:46] Vector to Dish what you get is the

[09:49] embedding for a rice dish there is

[09:52] another vector or embedding for Indian

[09:55] rice dis and to go from Rice dis to

[09:57] Indian rice dis you need to probably add

[09:59] this vector or a direction called

[10:01] indianness and same thing for sweet

[10:04] Indian rice dish in order to generate

[10:07] contextual embedding what we need to do

[10:10] is take the original static embedding

[10:13] for the word dish and have all these

[10:16] other words influence that static

[10:19] embedding or change that static

[10:22] embedding so that it can capture the

[10:25] meaning of all these adjectives once you

[10:27] have done that and once you have a a

[10:29] contextual embedding for dash it won't

[10:31] be hard to predict the next word which

[10:33] is K look at this another sentence where

[10:37] I'm saying D loves Dosa Dola and Millet

[10:40] and so on B loves pasta and so on they

[10:43] both went out for a dinner and here BN

[10:47] said bro we'll go to a restaurant that

[10:49] you like and after some time they were

[10:52] in Indian restaurant now the way you

[10:55] predicted this word Indian was based on

[10:58] this cont

[11:00] such as D LS all these items which are

[11:03] part of the Indian Cuisine also bin said

[11:08] to D bro will go to a restaurant you

[11:11] like if bin said here instead of you if

[11:14] he had said I this word will become

[11:18] Italian instead of Indian right also

[11:22] instead of B if there was double here

[11:25] then also this thing will become Italian

[11:29] so you can understand that this uh

[11:31] prediction Indian uh is influenced by

[11:35] not just the few words which are prior

[11:38] to that word but it can be influenced by

[11:41] some words which are far out in that

[11:43] paragraph okay to summarize the

[11:45] objective for this intelligent teacher

[11:48] cat is to generate a contextual

[11:52] embedding and if you think about this

[11:53] embedding space mathematically speaking

[11:56] what you're doing is taking the static

[11:59] embedding for the word Dash and then

[12:01] adding the embeddings for all these

[12:05] vectors ress indianness all these

[12:07] adjectives and getting your final

[12:10] contextual embedding let's now dig into

[12:13] the Transformer architecture the

[12:14] architecture has two components encoder

[12:18] and decoder the purpose of encoder is to

[12:21] take the input sentence and generate the

[12:24] contextual embedding for each of the

[12:26] words or each of the tokens in that

[12:29] sentence once the contextual embedding

[12:32] is generated we feed that to a decoder

[12:36] here and we try to predict the next word

[12:39] so if you're working on the next word

[12:42] prediction you will predict the next

[12:44] word here for example it will be here

[12:46] when you talk about natural language

[12:48] processing there are multiple tasks so

[12:50] one task is to predict the next word the

[12:53] other task would be to translate the

[12:55] sentence here I'm translating the

[12:58] sentence from English English to Hindi

[13:00] in that case you will still produce the

[13:03] contextual embedding from encoder you

[13:06] feed that to decoder and decoder will

[13:10] start predicting the next word so here

[13:12] it will start with this fixed start

[13:15] token and then it will uh produce the

[13:19] probability of the next word so here the

[13:22] probability of this word man is highest

[13:25] and here you can have the entire

[13:27] vocabulary for example in the case of

[13:31] bird you have some 30,000 words so

[13:34] you'll have all the words in your

[13:36] language and you will say okay what is

[13:38] the highest probability of my next word

[13:41] then you put that word man into this

[13:44] input okay so there are two inputs

[13:47] actually one is the contextual embedding

[13:49] which is coming uh from here from the

[13:52] encoder and the other one is whatever

[13:55] output you have produced so far from the

[13:57] previous step you feed that okay as an

[14:00] input here and then it will produce the

[14:03] next word which is K once again you

[14:06] provide key here and it produces banai

[14:10] So eventually it produces the entire

[14:13] translated sentence all right so that's

[14:15] the objective of your encoder and

[14:18] decoder part whatever I talked about so

[14:21] far I was referring to an inference

[14:25] stage of uh neural networks whenever you

[14:28] have these uh deep learning model you

[14:31] have two stages one stage is the model

[14:33] is not trained it's like a baby baby is

[14:35] not trained yet and you train them right

[14:38] you send them to school you train them

[14:41] uh in your home at some point they

[14:43] become adult and they can figure things

[14:45] out on their own similarly a machine

[14:48] learning model goes through a training

[14:49] phase and when it is ready it starts

[14:53] working on the real world problem and

[14:55] that is called inference so whatever I

[14:57] talked about for for predicting the next

[15:00] word or translating sentence I was

[15:03] referring to inference stage throughout

[15:05] the discussion we'll be referring to two

[15:07] specific models called bird and GPT if

[15:11] you look at this architecture that's a

[15:13] generic architecture for a Transformer

[15:15] model Transformer model is a general

[15:18] concept whereas bird and GPT are

[15:21] specific model or specific

[15:23] implementations based on Transformer

[15:26] architecture if you look at bird

[15:27] architecture it has only the encoder

[15:31] part okay so only this part decoder part

[15:34] is missing so but will take the input

[15:36] sentence it will produce the contextual

[15:40] embedding and that's it whereas GPT has

[15:42] only decoder so it still takes the input

[15:45] it will produce the contextual embedding

[15:47] and so on and then here it will predict

[15:51] the next word I mean it sounds like it

[15:53] has encoder decoder both but

[15:56] fundamentally the architecture looks

[15:58] little different for GPT but it is still

[16:00] based on the Transformer architecture

[16:04] the way they're trained is you take all

[16:06] the text from Wikipedia crawl text from

[16:08] internet book Corpus and you train these

[16:12] models when you have this article for

[16:14] example and you are having this sentence

[16:16] developing an advanced crude see if I

[16:19] give you this sentence most likely you

[16:20] will say spacecraft or vehicle as a next

[16:24] word you'll not say banana right like so

[16:26] probability of having banana as a next

[16:28] word in this sentence is very less

[16:30] whereas these two words have a high

[16:32] probability so we as humans have read so

[16:35] much text so now we have learned this

[16:38] art of predicting next word and same

[16:40] thing goes on for B and GPT where they

[16:45] understand the relationship between the

[16:47] words the context in which they appear

[16:49] so let's say if B during the training

[16:52] has encountered so much text and every

[16:55] time after this word crude if it has

[16:59] seen this word spacecraft of or vehicle

[17:03] uh it would not have seen words like

[17:05] crude chair or crude banana right that

[17:08] kind of words usually when it is going

[17:10] through training it will not see it so

[17:12] it will learn to predict the high

[17:14] probability worse right so for word like

[17:16] banana probability is going to be lower

[17:19] same thing for this article when you go

[17:21] through this kind of sentences right SI

[17:24] engaging both alliances and hostilities

[17:27] and there will be many more artic on

[17:29] battles and Warriors and everywhere

[17:32] after alliances there will be either

[17:34] hostilities or negotiations there will

[17:37] not be a word like chair so probability

[17:39] of that will be very very low now when

[17:43] you go through the training you are

[17:45] going through all these words right so

[17:47] all these words will form something

[17:49] called a vocabulary so for a Model A

[17:53] vocabulary will look something like this

[17:55] now there is a difference between a word

[17:57] and a token so for example here playing

[18:02] is a word and one of the way to tokenize

[18:05] is to have two token so one token is

[18:08] play second token is ing okay so token

[18:12] wise there are two tokens but word is

[18:14] just once but just for uh understanding

[18:19] purpose just for easy explanation you

[18:21] can think of word as tokens technically

[18:23] they're different but you can think of

[18:25] words as token only okay so let's say

[18:28] you have a vocabulary of all these

[18:30] tokens let's say

[18:32] 30,000 words what happens here is for

[18:36] each of these words during the training

[18:39] it will create those static embeddings

[18:42] so for word made or let's say for word

[18:45] and seven is the index and let's say

[18:49] this is the static embedding vector and

[18:52] the dimension or the size see there is a

[18:54] dot dot dot so what is the size of this

[18:56] well for bird it is 768 for GPT it's

[19:02] 12,228 right so based on model the

[19:04] dimension of your embedding Vector can

[19:07] vary when you go through this training

[19:11] for every token in your vocabulary you

[19:14] will have this static embedding and this

[19:17] whole table is called Static embedding

[19:21] Matrix during the training it will also

[19:24] learn few other things such as WQ WK WV

[19:28] and you are like what the hell this is

[19:30] well we will talk about this later but

[19:33] for now just remember when you train

[19:35] these models they are having this static

[19:39] word embedding metrix which is the

[19:41] static embedding for every token in your

[19:44] entire vocabulary as well as they're

[19:46] having the special metries WQ WK WV

[19:50] which we'll talk about later let's have

[19:52] a look inside the encoder and review

[19:55] this specific two steps so you give a

[19:59] sentence to your Transformer and it will

[20:02] first tokenize it tokens are kind of

[20:05] like words but for a word called there

[20:08] will be two tokens call and Ed so it

[20:11] will tokenize it and there are various

[20:13] ways to tokenize your sentence this is

[20:16] one of the ways it will also add special

[20:19] tokens at the end and at the beginning

[20:22] CLS and sep sep is for separators so if

[20:25] you have two sentences between two

[20:27] sentences there will be a separator and

[20:29] CLS will be added at the beginning and

[20:32] this I'm talking about bird then it will

[20:35] also generate token IDs so for each of

[20:38] these words there will be an index into

[20:41] your vocabulary for example made is

[20:44] 2532 which means in your entire

[20:47] vocabulary which is just like a list

[20:49] made word is at position

[20:54] 2532 if you talk about bird it has total

[20:58] 30,00 , 522 tokens and GPT has around

[21:03] 50,000 tokens from these token ID so

[21:07] step number one was generate token and

[21:09] token IDs then you uh get the static

[21:14] embedding for each of these tokens and

[21:16] from where do you get it well we just

[21:18] saw right during the training you are

[21:21] generating this static word embedding

[21:23] metrix so for each of the words or

[21:26] tokens you have

[21:29] the static word embedding so in the case

[21:32] of bird the size of this will be 768 if

[21:36] it is GPT it will be 12,000 you know

[21:38] that long embedding metrics so you

[21:41] produce that for each of the tokens and

[21:44] then you will also create something

[21:46] called a positional embedding now in the

[21:49] language the word order matters okay so

[21:52] if I put made before I it will change

[21:55] the meaning of that sentence so the

[21:56] order matters and the way Transformer

[21:59] works is it will process the entire

[22:02] input or sequence in parallel it is not

[22:05] like RNN where it will process these

[22:08] words one by one it will process this

[22:11] sequence All In Parallel now it needs to

[22:15] have knowledge on the order okay so for

[22:18] that it uses a special technique called

[22:21] positional embedding where it will add a

[22:26] small Vector in each of these embeddings

[22:29] okay so let's say this is the vector for

[22:32] position number one this is the vector

[22:34] for position number two and so on and

[22:37] when you get uh this resulting Vector

[22:41] this Vector will embed the knowledge of

[22:45] position so this Vector will have a

[22:48] knowledge that this is the first word

[22:50] this Vector will have a knowledge that

[22:52] this is the second word now how exactly

[22:55] that is done well there is a math behind

[22:57] it I'm not going to go into the math but

[22:59] I'm showing you the formula from the

[23:01] original Transformer paper so using this

[23:04] formula you are essentially uh deriving

[23:07] all these positional embeddings all

[23:10] right so that was step two the first

[23:13] step was to produce the static embedding

[23:16] for each of the tokens and then the

[23:19] second step is to add positional

[23:21] embedding like this is a plus sign so

[23:23] here at this point what you get is this

[23:27] kind of position

[23:29] embedding just like how my nephew needs

[23:31] my attention words also need attention

[23:34] of surrounding words in order to produce

[23:38] the contextual embedding in

[23:42] 2017 a groundbreaking research was done

[23:46] when this paper attention is all you

[23:48] need was published by bunch of Google

[23:52] researchers and that completely

[23:54] transformed the landscape of AI okay and

[23:58] and this is the architecture that we are

[24:00] talking about the architecture is taken

[24:02] from this attention is all you need

[24:05] paper so the way it works is when you

[24:07] have this sentence the word Indian needs

[24:11] attention from Dosa Dola Etc you can say

[24:15] that all these words are attending to

[24:19] this word Indian even the word b instead

[24:22] of B in let's say here if I had Dil this

[24:26] will become Italian instead of Indian so

[24:29] this word bin is attending to this word

[24:32] Indian Dosa Dola Etc is also attending

[24:34] to this word Indian similarly uh words

[24:38] Sweet Indian rice Etc are attending to

[24:43] this word dish now how much they're

[24:46] attending to this word well that

[24:48] attention weight or attention score

[24:50] might be different for example sweet

[24:53] might be attending to this word dish by

[24:55] 36 person let's say Indian is attending

[24:58] in it by 14 person rice is attending it

[25:01] by 18 person these are the adjectives

[25:04] which will enrich the meaning of word

[25:06] dish on the other hand the word I made

[25:11] Etc are not enriching the meaning of

[25:14] word this that much because instead of I

[25:17] if I had Rahul or moan or David the

[25:22] meaning of this word will not change

[25:24] that much but instead of sweet if I have

[25:27] spicy all of a sudden the embedding or

[25:30] the meaning of this dish changes because

[25:32] as a next word I will immediately have

[25:34] Biryani instead of K the goal here is to

[25:38] build this kind of attention weight or

[25:41] attention score okay for each of these

[25:43] words it's a matrix because for Dish all

[25:47] the other words in that same sentence

[25:50] are they are enriching the meaning of

[25:53] that word okay so for word dish let's

[25:56] say sweet is attending it by 36 person

[26:00] Indian is attending it by 11 person and

[26:03] so on and by the way I have just made up

[26:04] these numbers just for explanation

[26:06] purpose the word dish also uh attends to

[26:11] that word itself right because dish

[26:13] itself has some meaning dish means dish

[26:15] right so that will also attend to itself

[26:19] so for every word see right now for Dish

[26:21] I have all this scores for Rice you will

[26:24] have scores for Indian for every word

[26:27] you will try to compute these attention

[26:30] scores and then you will use this

[26:33] concept of query key and value to uh

[26:38] come up with the contextual embedding

[26:40] now let me explain you query key and

[26:42] value by going over analogy let's say

[26:44] you're going to a library looking for a

[26:47] book on quantum physics especially

[26:49] Quantum computation you might have this

[26:52] query that hey I'm looking for this

[26:55] quantum physics book and this particular

[26:57] person who is a librarian will use the

[27:02] book index so he'll go to his computer

[27:04] try to search for that book or maybe he

[27:06] will go to this rack and locate a

[27:08] specific rack which has a label quantum

[27:11] mechanics okay so for him the key or the

[27:15] index to locate that book is the label

[27:19] on the rack you know in library you see

[27:21] like history drama science those kind of

[27:24] labels or you have book description okay

[27:28] so based on book description the rack

[27:31] label you will figure out the

[27:34] appropriate book so The Book Rack book

[27:37] description Etc is called key and then

[27:41] the actual book content is your value so

[27:44] let's say you pull this book okay and

[27:46] whatever content the actual content of

[27:48] that book is value let me give you

[27:51] another example let's say there is a

[27:53] college professor who wants to write an

[27:55] essay on Quantum Computing and he needs

[27:57] help help of bunch of students so when

[28:00] he talks to these students moan says

[28:03] that I know linear algebra Mera says

[28:06] that I know quantum mechanics Bob will

[28:09] say hey I know philosophy same way Kathy

[28:12] knows computer science so here whatever

[28:16] moan mea Bob Kathy are claiming about

[28:20] their knowledge is called key and what

[28:24] happens after that is each of these

[28:26] students will start writing an essay so

[28:29] teacher will say okay just go and write

[28:33] um some bunch of paragraphs so mea moan

[28:36] Kathy Bob wrote all these paragraphs

[28:39] which are called value and then teacher

[28:42] knows that mea knows most about quantum

[28:45] mechanics okay so he will take 60% of

[28:49] mea's content or mea's value he will

[28:53] take 29% of kath's value because

[28:56] computer science and quantum Computing

[28:58] so that it's kind of related so he will

[29:01] use 60% of mea's content 29% of Kathy's

[29:05] content to formulate that final essay on

[29:09] the other hand Bob's content he will use

[29:12] only one person because the query and

[29:15] key are not matching that much see Bob

[29:17] has a knowledge on philosophy but our

[29:19] query requires Quantum Computing so

[29:22] query and key we can say they're not

[29:24] matching in terms of math you can think

[29:26] about Dot produ so let's say dot product

[29:29] between query and key Vector is less

[29:32] let's say only one person okay but in

[29:35] the other case mea query and key dot

[29:38] product is higher let's say 60% so you

[29:40] will take 60% of mea's value which is

[29:44] the essay written by mea on Quantum

[29:48] Computing now same way for our sentence

[29:51] the query for Dish is I want to know

[29:54] about my modifiers okay I'm just giving

[29:56] you analogy by the way way the real

[29:59] working is little different but let's

[30:01] say you are generating contextual

[30:03] embedding for the word dish and the

[30:05] query may look something like I want to

[30:08] know about my modifiers right like my

[30:10] adjectives all these adjectives which

[30:12] modifies my meaning and the key will be

[30:16] uh the description that each of these

[30:18] words are giving about themselves for

[30:20] example I will say I'm the subject of

[30:22] the sentence made will say I indicate an

[30:25] action or a verb similarly sweet will

[30:27] say say I am an adjective describing

[30:30] taste and so on so these are called keys

[30:34] and based on the dot product between

[30:37] query and key yeah you're trying to find

[30:39] out you know which things are matching

[30:42] so if if dish wants to know about

[30:44] modifiers I think these are the

[30:46] adjectives which modifies the meaning of

[30:49] word dish so the score attention score

[30:52] for these will be higher whereas the

[30:55] tension score for these will be lower

[30:58] now once you get all these attention

[30:59] scores you need value so each of these

[31:02] words will now say the value value means

[31:05] uh the component that it is contributing

[31:09] to that query so I will say Indian will

[31:12] say the style or origin is Indian sweet

[31:14] will say The Taste is sweet similarly

[31:17] all these words will have specific value

[31:21] and then uh let's consider the values of

[31:24] only these four words I mean as such it

[31:25] will use values of all the words but for

[31:27] simpl let's say only these four words

[31:30] these values by the way will be some

[31:33] kind of vector we'll look into how

[31:35] exactly those vectors are derived but

[31:37] let's say these values are all these

[31:39] vectors and query also has like dish

[31:42] also has its own Vector right like this

[31:43] is the static embedding so this is its

[31:46] own vector and now what you do is in

[31:48] static embedding you add all these

[31:50] vectors and all these vectors you can

[31:52] think about as ress indianness okay so

[31:56] see this is how you add all of them okay

[32:00] you add all of them actually the vector

[32:03] of all the other words and you get the

[32:06] final context of where embedding in

[32:08] terms of the embedding space it is like

[32:11] going from dish to ress indan ress and

[32:14] so on so these vectors right ress

[32:16] indianness sweetness are these vectors

[32:20] okay this is just a mathematical

[32:22] representation now let's look at how

[32:24] those vectors are built so here you have

[32:27] a query for Dish okay so let me just

[32:30] represent it as a horizontal right this

[32:32] was a vertical format this is horizontal

[32:34] format the same thing for each of these

[32:37] words or tokens you will first get their

[32:40] embedding from our stating embedding

[32:42] Matrix okay so these are static

[32:44] embeddings for each of these words in

[32:46] the case of bird the dimension is 768

[32:48] for GPT is 12,000 something let's say

[32:51] for word dish this is my embedding let's

[32:54] call it E7 that E7 you will multiply

[32:58] with a special Matrix called WQ which

[33:03] will have a

[33:05] dimension uh of 64 by 768 so 768 is the

[33:10] columns in order to perform matrix

[33:12] multiplication The Columns in the first

[33:14] Matrix should be equal to rows in the

[33:16] second Matrix so this is 6 768 this is

[33:19] 768 the rows in The Matrix the first

[33:22] Matrix is 64 for bir for GPT is

[33:26] different and when you do

[33:28] uh this kind of matrix multiplication

[33:31] you will get uh this quy Vector okay so

[33:35] you will multiply this row with this

[33:38] column okay so you multiply 50 with this

[33:42] 0.9

[33:44] minus5 with

[33:46] 1.07 65 with this and then you add them

[33:50] all up you put them here then you take

[33:53] the second row multiply 23 with this

[33:56] minus 71 with this 1.58 with this and

[33:59] you put that here and so on okay so this

[34:03] is how you build a query Vector now WQ

[34:06] here knows how to encode query of a

[34:10] token for attention computation when we

[34:14] train the model we already got the WQ

[34:17] and WQ after the training is done it it

[34:22] doesn't change okay after you do that

[34:24] training sometime it is referred as

[34:26] pre-training on huge amount of data you

[34:30] build this WQ Matrix which doesn't

[34:32] change okay so for a train model uh this

[34:35] WQ will not change you multiply that

[34:37] with specific embedding E7 let's say

[34:41] this is a positional embedding you get

[34:43] Q7 which is the query Vector for the

[34:47] word dish and you repeat the same

[34:49] process for all the words okay so how

[34:51] you have Q7 for dish for Rice you will

[34:54] have q6 Indian you have uh Q5 and so on

[34:58] to summarize WQ here knows how to encode

[35:03] query of a token for attention

[35:05] computation and remember in one of the

[35:07] previous slides I said that when the

[35:10] model is strained it will have static

[35:12] embedding metrix but it will also have

[35:14] this WQ WK WV and that is what I was

[35:18] referring to okay so we just talked

[35:20] about WQ here the question now is during

[35:23] the training how exactly we get WQ WK WV

[35:27] well we take this Transformer

[35:29] architecture and we train it on huge

[35:32] amount of data so we take all the

[35:33] Wikipedia text and we generate this kind

[35:36] of X and Y pairs okay so you don't have

[35:39] to manually label it this is called uh

[35:42] self-supervised data set uh you don't

[35:45] need a person to label it because you

[35:47] can just split a sentence you can have a

[35:48] sentence and the next word is your y

[35:51] okay so this is your X this is your y

[35:53] you feed X as an input and when the

[35:56] model is not train TR it will not

[35:58] predict right things it will make error

[36:00] so let's say for this it produce Mexican

[36:03] which is your why hat okay it's a

[36:05] predicted value your actual value is

[36:07] Indian so that is why you calculate

[36:09] error and then you back propagate that

[36:12] error through back propagation and chain

[36:15] rule partial derivative and so on folks

[36:17] you need to have understanding of how

[36:20] back propagation Works what is a chain

[36:22] rule you need to know all those deep

[36:24] learning fundamentals okay I have

[36:26] covered that in other modules if you're

[36:29] part of my courses or boot camp you

[36:32] would have seen those if you're watching

[36:33] it from YouTube Again YouTube has uh

[36:36] these kind of tutorials my channel has

[36:38] these tutorials so you need to know how

[36:40] the back propagation Works essentially

[36:44] you are feeding this data set you're

[36:45] Computing the error and you're back

[36:47] propagating it throughout this

[36:49] architecture and during that back

[36:51] propagation when let's say you train

[36:53] this on millions and millions of

[36:55] sentences that is the time when uh this

[36:59] WQ WK WV will be finalized inside this

[37:04] model architecture now going back for

[37:08] Dish query we computed this particular

[37:11] query Vector next step is to compute the

[37:16] key vectors okay so I gave this kind of

[37:18] analogy description to uh get you an

[37:21] intuitive understanding but in reality

[37:24] these will be the vectors so let's see

[37:27] how those vectors are formed so here I'm

[37:30] taking the first token I and the keys

[37:33] look something like this okay so here

[37:36] you will take the positional embedding

[37:38] the static embedding for the word I and

[37:41] you will multiply that with another

[37:44] magical Matrix WK once again WK after

[37:48] your pre-training after that model is

[37:50] trained it is fixed so you take that

[37:53] Matrix and you uh figure out your K1

[37:57] okay here WK knows how to encode key of

[38:01] a token for the attention computation

[38:05] then you go to the next word compute K2

[38:08] next word compute K3 you do that for all

[38:11] the words so now for all these words we

[38:15] have these key vectors okay so you have

[38:18] Q7 uh query Vector you have key vectors

[38:22] and you take the dot product between

[38:24] these two okay so q1 K1 Dot Q7 okay so

[38:29] if you take these dot product between

[38:31] these two vectors you'll get some number

[38:34] right like

[38:36] 3.33 57 101 whatever that number is it

[38:40] it's a single number you will get that

[38:43] for all the tokens okay and then you let

[38:48] it pass through a soft Max function from

[38:51] Deep planning fundamentals you should

[38:52] know about softmax softmax will convert

[38:55] bunch of values into probability distrib

[38:57] ution so that when you add all these

[38:59] values it will be one so soft Max is

[39:02] converting all these discrete values

[39:05] into probability distribution so that

[39:07] you can express them as percentages and

[39:10] the sum of all these percentage will be

[39:12] one mathematically you can represent

[39:15] this operation as soft Max between q and

[39:18] KT now KT is K transpose okay so here Q

[39:24] was a vector but if you talk about let's

[39:26] say this K right so K is k1 K2 K3 so

[39:30] it's not just one vector actually it's

[39:32] like bunch of vectors so this can be

[39:34] thought of as a matrix and to multiply

[39:37] that you need to do a transpose see if

[39:39] you are multiplying Q7 with K1 like a

[39:42] single Vector you don't need to do

[39:44] transpose but when you have Matrix you

[39:46] need to do transpose okay so we'll use

[39:48] this formula later on in the final

[39:50] attention formula but for now just

[39:52] remember that there is this kind of

[39:54] formula as a Next Step once you have

[39:57] comp Ed these attention scores or

[39:59] attention weights you need to find the

[40:03] value Vector right so this was a

[40:05] descriptive uh understanding of value

[40:07] Vector but the way value vectors are

[40:09] derived is similar for each of the

[40:12] tokens you get positional embedding

[40:15] static embedding then you multiply that

[40:18] with another Vector called WV you get V1

[40:22] and here WV knows how to encode value of

[40:26] a token for attention computation okay

[40:29] so you do that for all the words so V1

[40:32] V2 V3 V4 V7 and so on okay so for all

[40:37] these words you will uh get their values

[40:41] and you multiply that with the weight so

[40:45] you will have more component from this

[40:47] V4 Vector because it's like 36 person

[40:50] but the component that you will use from

[40:52] V1 will be very less 7 person okay so

[40:56] see the sweetness you're taking

[41:00] 36% uh here I don't have things in order

[41:03] but you essentially add all the vectors

[41:06] okay so you just add all of this

[41:09] everything okay so here I'm not showing

[41:11] everything but you kind of get an idea

[41:13] so from static embedding you go all the

[41:16] way to context aware embedding here's

[41:19] the mathematical formula for attention

[41:21] qk V where DK is a dimension of a key

[41:26] vector

[41:27] in case of GPT this is 128 so what they

[41:31] do is they take um the entire 12

[41:36] 228 Dimension right for GPT the

[41:40] dimension of the contextual embeddings

[41:42] is 12 228 and you divide it by the

[41:46] number of attention heads I think for

[41:48] GPT is 96 and that's how you get 128 I

[41:52] will explain this 96 a little later but

[41:54] there is a way to derive this number 128

[41:58] so you do division by square root of

[42:02] that just for numerical stability you

[42:04] don't want this dot products to become

[42:06] very high okay so to bring down that

[42:08] number we do kind of scaling here and

[42:11] you do soft Max and you multiply that

[42:14] with this value V so far what we talked

[42:17] about is a single attention block

[42:20] actually there are multiple attention

[42:23] blocks so that's what we'll cover next

[42:25] let's understand what is multi head

[42:28] attention so far we have seen this

[42:31] picture where you take positional

[42:34] embedding for each of the words in your

[42:35] input sequence you let it go through

[42:39] attention head which is basically taking

[42:42] this WQ WK

[42:45] WV and coming up with context our aware

[42:49] embedding so that whole portion is

[42:51] called one attention head in reality you

[42:55] have multiple attention heads okay so

[42:57] you have multiple attention heads each

[43:00] of these heads are producing their own

[43:02] context aware embedding which you will

[43:05] add them up all together to get the

[43:09] final context aware embedding now what

[43:12] is the purpose of this multiple

[43:13] attention heads one attention head will

[43:16] be working on adjectives okay so for the

[43:19] word dish sweet Indian rice Etc are

[43:22] adjectives the second attention head

[43:26] might be working on on a verb okay so

[43:29] how this verb made uh affects the

[43:32] contextual embedding of the word dish

[43:34] the third attention block might be

[43:36] looking at pronoun so you can think of

[43:40] this as looking at different aspects of

[43:43] a language or different aspects of that

[43:47] context okay for the other sentence the

[43:49] first attention head might be looking at

[43:51] a cultural context such as Dosa Dola

[43:54] Millet bread are all Indian Delicacies

[43:57] whereas the second attention head might

[43:58] be looking at the pronoun where instead

[44:02] of the and B if I exchange the order of

[44:05] these two uh here you will have Italian

[44:07] similarly instead of you if I say I

[44:11] again here you will have a different

[44:13] word so there is a pronoun context the

[44:15] third attention uh head might be looking

[44:17] at action and timing you know you're

[44:20] driving 20 minutes Drive Etc so the

[44:23] purpose of multi- attention heads is to

[44:26] allow the model

[44:27] to focus on different aspects or

[44:32] different types of relationships between

[44:35] tokens in a language when you have

[44:37] multiple tokens there is a different

[44:39] type of relationship between these

[44:41] tokens such as semantic positional

[44:45] syntactic uh

[44:47] simultaneously uh enriching the

[44:49] contextual understanding of each uh

[44:53] token so I want you to read this

[44:55] sentence again uh I hope hope you get an

[44:57] idea it is basically looking at

[44:59] different aspects or different

[45:01] relationship between the tokens to

[45:04] enrich the

[45:05] contextual understanding of each token

[45:09] so here in this particular architecture

[45:12] diagram see first we produced this uh

[45:15] static embedding then we added this

[45:17] positional encoding right so you got

[45:19] positional encoding here uh you ignore

[45:22] this normalization part for now

[45:23] normalization is simple actually it's

[45:25] like uh normal izing it to Value which

[45:28] is zero mean and one standard deviation

[45:31] and then looking at v k q kind of metrix

[45:36] to uh use multi-headed attention to

[45:40] derive ec1 ec2 these

[45:44] individual uh contextual embedding and

[45:46] you add all of them up to produce your

[45:51] final context of our embedding which

[45:53] will come here and by the way this is a

[45:56] residual connection uh if you know about

[45:58] deep learning you will have uh this um

[46:03] residual connection that helps you uh

[46:06] with a smooth gradient flow after this

[46:09] block the next block is feed forward

[46:12] Network so you'll ask me okay I already

[46:15] have context over embedding now why do I

[46:18] need this feed forward Network well the

[46:21] thing is you don't have your final

[46:23] context aware embedding yet so here at

[46:26] this point

[46:27] the embeddings are enriched but they are

[46:30] not still fully furnished yet you have

[46:33] to let it go through this feed forward

[46:36] Network so what happens is you passed

[46:40] your positional embedding through bunch

[46:41] of attention heads and you got this

[46:44] enriched contextual embedding that will

[46:47] go through a fully connected neural

[46:49] network layer okay so here the input

[46:53] neurons will be same number of uh

[46:56] elements as this embedding so in case of

[46:59] bir let's say this will be

[47:01] 768 for GPT it will be 12

[47:05] 228 and then in the hidden layer you can

[47:08] have uh n n number of neurons and in the

[47:11] output layer again you'll have same as

[47:13] this one because this input and this

[47:16] output will have a same size so if this

[47:18] is 768 this will also be 768 okay so you

[47:21] let it go through this feed forward

[47:24] Network and the resulting embedding that

[47:28] you get is even more enrich it's like a

[47:30] more furnac product now this neural

[47:34] network weights you know this will have

[47:36] a lot of weights and parameters those

[47:38] weights and parameters are set once

[47:40] again during that training process so

[47:43] when you're going through this XY pairs

[47:45] right your training pairs you might have

[47:48] hundreds and thousands of these

[47:50] sentences when you're training that

[47:51] Network during that training look at

[47:54] this feed forward Network you know

[47:55] during back propagation

[47:57] those weights are getting adjusted and

[48:00] it will help you refine your sentence

[48:03] further now once you get enriched

[48:06] embedding you will add that into your

[48:08] original embedding and you get the final

[48:11] now it is final now it's a final

[48:14] contextually Rich embedding so the

[48:16] purpose of feed forward network is it

[48:19] will enrich each token embedding by

[48:22] applying nonlinear transformation

[48:25] because in the attention head you are

[48:27] applying linear transformation here you

[48:30] get an opportunity to apply nonlinear

[48:32] transformation independently enabling

[48:35] the model to learn complex patterns and

[48:38] higher order features Beyond just the

[48:41] contextual relationship see multi-head

[48:43] attention is just capturing those

[48:44] contextual relationships how these words

[48:47] are related to each other but language

[48:49] is nonlinear it's not just the

[48:51] relationship right there are like some

[48:52] nuances nonlinearity complexity all of

[48:56] that can be captured by this fully

[48:58] connected layer or feed forward Network

[49:01] so to better visualize each of the words

[49:04] in your sentence let's say you have I

[49:06] made dish every word will go through

[49:09] positional embedding and every

[49:11] positional embedding goes through

[49:13] multiple attention head so the embedding

[49:15] for I will go through all the heads okay

[49:18] so in GPT if you have 96 heads it will

[49:21] go through all those 96 heads similarly

[49:24] made will also go through 96 heads and

[49:26] this this is happening in parallel it's

[49:28] not like you process I first and made no

[49:31] all of these things are happening in

[49:33] parallel and each of these uh vectors

[49:36] will also go through the feed forward

[49:38] Network parall at the same time right so

[49:41] the same network is available for each

[49:42] of these words and you get all these

[49:45] contextually enriched embeddings okay so

[49:48] that comes here so after feed forward

[49:51] Network here at this point you get all

[49:55] these m Bings okay and then you have

[49:59] this uh plus sign and normalization so

[50:02] normalization layer by the way this Norm

[50:05] is uh just it's ensuring that you have

[50:08] stable learning improving the gradient

[50:10] flow if you have deep learning

[50:12] fundamentals you will understand what I

[50:14] mean uh in machine learning generally

[50:16] when you have all these wide range of

[50:19] values if you normalize them let's say

[50:21] you normalize them to zero and one you

[50:23] get better control over your training

[50:26] now you also notice this anx layers so

[50:29] anx layers is basically for B let's say

[50:33] if you have a b base model you have 12

[50:36] such layers okay so this is a

[50:37] Transformer block so you kind of repeat

[50:40] so you have one block then after that

[50:42] you have another block so in case of BT

[50:44] base model you have 12 layers B large

[50:47] you have 24 layers in case of GPT again

[50:51] there will be different number of layers

[50:52] so that's what this NX layers means all

[50:56] right f finally we are done with

[50:58] understanding encoder I just want to

[51:00] summarize we had an input sequence we

[51:03] generated a static embedding here then

[51:06] here we generated a positional embedding

[51:08] then we have one Transformer block or NX

[51:11] layer where we first normalize we use

[51:14] vkv uh to compute attention score or

[51:17] attention weight we have multiple such

[51:20] heads and then you go through

[51:22] normalization you have feed forward

[51:24] Network you kind of ADD remember like

[51:27] you have original embedding and then you

[51:28] add that output and you get the final

[51:32] contextual embedding you normalize it

[51:34] and here at this point you are getting a

[51:37] final

[51:38] contextual uh very enriched embedding we

[51:41] have covered most of the Transformer

[51:43] architecture decoder is not going to

[51:45] take uh much time so let's spend few

[51:48] minutes understanding decoder so the

[51:51] output of encoder is a contextual

[51:54] embedding or context Rich embedding

[51:57] which you give it as an input to decoder

[52:00] and decoder will produce the next word

[52:03] if you're working on next word

[52:04] prediction if you're working on language

[52:07] translation it will uh start with this

[52:10] special token called start and then it

[52:12] will produce May then another work here

[52:15] banay and so on okay so that's a goal of

[52:18] a decoder now here you will notice one

[52:22] thing which is called

[52:24] multi-headed cross attention okay so

[52:27] let's understand what exactly is cross

[52:29] attention let's say you have this

[52:31] sentence I made kir which you want to

[52:33] translate into Hindi here you will have

[52:36] key vectors and value vectors as we have

[52:39] discussed before but the query Vector

[52:41] will be little different so query Vector

[52:42] will be start and it will be like I'm

[52:45] starting to generate translation what

[52:47] part of the input should I focus on and

[52:49] then when you have next word which is

[52:51] man which is the first word in your

[52:54] translation it will be like I generated

[52:57] the subject what is the subject okay

[53:00] then you will have here I generated the

[53:02] subject and object help me complate the

[53:05] sentence with a verb form so the query

[53:08] part is little different see in the

[53:11] previous example you had only one

[53:13] sentence so I made key so you'll

[53:15] generate a query from made let's say and

[53:18] you will have key and value from same

[53:20] sentence in case of language translation

[53:23] it's little different uh so here we need

[53:25] to use cross cross attention why cross

[53:28] attention because query you are using it

[53:30] from the translated sentence in Hindi

[53:34] whereas key and values are being used

[53:37] from the original sentence in English so

[53:40] that is why it's called cross attention

[53:43] in the diagram you can see here the V

[53:45] and K values are coming from your

[53:47] encoder encoder has processed this im

[53:50] here okay so V and K are coming from

[53:54] encod see this is the arrow whereas Q is

[53:58] coming from the decoder itself right so

[54:01] you know that Hindi sequence will be

[54:03] produced here so man K banai Etc so that

[54:07] query part is coming from here that is

[54:10] why it is called cross attention in the

[54:13] case of B we all know that there is only

[54:15] encoder part decoder part is not there

[54:18] in case of GPT the Transformer

[54:20] architecture is little different okay so

[54:22] that's the encoder part remaining part

[54:24] we have already understood now let me uh

[54:27] show one nice tool which can visually

[54:30] show you this architecture someone has

[54:32] built this nice visualization tool you

[54:34] can go to Pol club. github.io

[54:36] Transformer explainer and here you can

[54:38] look into different examples right so

[54:41] for example let's look at this sentence

[54:43] as the space ship was approaching the it

[54:47] will try to autocomplete that word and

[54:50] say station right now each of these

[54:53] words has the space ship first you had

[54:57] this Dropout layer so we talking about

[55:00] Transformer here so it will have a

[55:02] Dropout layer the architecture is little

[55:05] customized compared to the base

[55:07] Transformer architecture then you have

[55:09] this residual connection okay so

[55:11] residual connection will take you from

[55:13] here to here if you talk about

[55:15] embeddings you have token embeddings for

[55:18] each of these right see 768 is the size

[55:21] then you have positional embedding okay

[55:24] you add all this position and you get

[55:25] final Vector this is a

[55:27] positional embedding of a sentence and

[55:31] you have residual connection then you

[55:33] have q KV computation so if you look at

[55:37] this particular block here qkv it will

[55:41] kind of visually show you how you

[55:43] compute q k v and get all those three

[55:47] vectors right like query Vector key

[55:50] vector and value vector and then uh you

[55:54] have this output which you feed it to

[55:57] MLP is your feed forward neural network

[56:00] that we talked about okay so feed

[56:01] forward neural network then you again

[56:03] have a residual block and then you have

[56:05] layer normalization and this is one

[56:08] Transformer block you have multiple of

[56:10] them like 11 okay so that Annex layer

[56:13] that I was referring to is one block so

[56:16] you have repeated blocks and in the end

[56:18] you get this kind of soft Max

[56:20] probability see the probability here is

[56:23] a station okay so just play with this

[56:26] particular Tool uh to get a better

[56:29] understanding of this thing and I want

[56:31] to give credits uh to this amazing

[56:34] Channel called 3 blue one brown so if

[56:37] you go to YouTube and type in

[56:40] Transformer explain 3 blue one brown you

[56:43] will find all these videos so I want you

[56:45] to watch from video number 5678 onwards

[56:50] he will have more videos as well uh

[56:52] especially these three video dl5 dl6 dl7

[56:57] these three videos you must watch it

[56:59] will enhance your understanding further

[57:02] I myself learned a lot from this channel

[57:06] so due credits to three blue one brown

[57:09] all right that's it folks so that's that

[57:11] was about Transformer I know it was a

[57:12] long discussion there were many topics

[57:15] that we covered but hopefully your

[57:18] understanding is clear if you have any

[57:20] question please feel free to ask

[57:23] [Music]

[57:26] a

[57:27] [Music]

⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.