---
title: 'How Stable Diffusion Works (AI Image Generation)'
source: 'https://youtube.com/watch?v=sFztPP9qPRc'
video_id: 'sFztPP9qPRc'
date: 2026-06-28
duration_sec: 0
---

# How Stable Diffusion Works (AI Image Generation)

> Source: [How Stable Diffusion Works (AI Image Generation)](https://youtube.com/watch?v=sFztPP9qPRc)

## Summary

The video explains how Stable Diffusion works, a leading AI image generation model. It covers key concepts like convolutional neural networks, the UNet architecture, diffusion models, and how text prompts are encoded and used to guide image generation. The goal is to provide a conceptual understanding without heavy math.

### Key Points

- **Convolutional Layers vs Fully Connected** [03:23] — Convolutional layers are more efficient than fully connected layers for images because they use a small kernel (e.g., 3x3) to process local pixel groups, drastically reducing parameters (e.g., 25 vs 100 million for a 100x100 image).
- **UNet's Upsample-Downsample Architecture** [07:24] — The UNet architecture first downsamples an image to a low resolution to extract features, then upsamples it back to the original size. This is efficient for semantic segmentation.
- **Feature Extraction and Field of View** [10:34] — The UNet increases the number of channels (e.g., from 3 to 1024) to extract increasingly complex features, and downsamples to increase the field of view without increasing kernel size.
- **Denoising with a UNet** [12:40] — A UNet can be used for denoising by training it to predict the noise added to an image. Subtracting the predicted noise recovers a cleaner image. The process is done in many small steps for high quality.
- **Positional Encoding for Noise Level** [14:37] — Positional encoding using sine and cosine functions converts a discrete noise level (position) into a continuous vector that can be injected into the network.
- **Latent Diffusion Model for Speed** [20:35] — To speed up training, stable diffusion uses a latent diffusion model. An autoencoder compresses the image (e.g., 512x512 pixels to 4x64x64 latent space), reducing data by 50x. Denoising happens in this latent space.
- **Word Embeddings Capture Semantics** [23:11] — Word embeddings (e.g., from Word2Vec or CLIP) are vectors that capture semantic meaning. For example, 'king' - 'man' + 'woman' results in a vector close to 'queen'.
- **Self-Attention for Text** [25:56] — Self-attention layers process text by comparing query, key, and value vectors (derived from learned matrices) to understand relationships between words in a phrase.
- **Cross-Attention: Image Meets Text** [29:49] — In Stable Diffusion, cross-attention layers are used at multiple points in the UNet. The image data provides the query, and the text embeddings provide the key and value, allowing the text to guide image generation.

## Transcript

We live in a world where artists are
losing their jobs because you can
generate whatever piece of art you want
with a simple text prompt within a few
seconds that looks incredibly good. More
than that, you can generate an image of
any thing, even things that don't exist
in real life, just by using the right
descriptions. What happened? Just last
week, I spent 2 hours trying to connect
to a wireless printer. How did the
computers get here? This video will be
highly technical and try to explain how
stable diffusion works, which is
currently the best method of image
generation that we have, beating out
older technology like generative
adversarial networks or GANs. Now
you've seen the length of the video.
It's a long video, but if you go search
up other machine learning videos online
this will probably still be the least
technical out of all of them because
I've tried to cut out all the math to
make everything conceptually easier to
understand while keeping the information
mostly accurate. Look, I know you're
passionate and curious about this
technology. So, I want you to try to pay
attention. Even though, if I'm going to
be honest, you probably won't understand
a lot of it the first time you watch it
through. If you can grasp the intuition
and concepts well, and then you decide
to look more at the math and derivations
or to pursue a career in this field
everything will be a lot easier to
understand. And that's part of why I
made this video. AI is the future. And
you know what they say, when there's a
gold rush, sell shovels. Now, a lot of
people are worried about AI safety
thinking AI is going to take over the
world. Me personally, I'm not too
pressed because I can't even get Chad
GBT to solve a simple math problem
correctly. But what I can say you should
be worried about is cyber security
which our video partner today, NordVPN
wants you to learn about. The process of
making this video involved a lot of
research and developing neural networks
on Google Collab, all of which uses the
internet, and sometimes I'd be away from
home at a public library or cafe using
the free Wi-Fi. Now, you'd be surprised
how easy it is to compromise these
public networks or make a fake network
just to steal your data. And this is
called a man-in-the-middle attack. So
to make sure my bank account information
and passwords don't get stolen by a
hacker, I use NordVPN to make sure my
internet connection was securely
encrypted at all times. And other than
that, NordVPN also has a bunch of other
features like their threat protection
and dark web monitor features to protect
you against fishing attacks, password
leaks, malware, ransomware, and the
like. I mainly use NordVPN for security
but sometimes I can also use it to watch
shows that are only available in certain
other countries or get plane tickets for
cheaper prices in other areas of the
world. For example, if I want to connect
to Japan's internet, I just click on
Tokyo and I'm there. So, NordVPN is
offering an exclusive deal if you go to
nordvpn.com/gonkey
where you can get a 2-year plan with
extra months for free with a 30-day
money back guarantee. I'll leave that
link in the description and pinned
comment below. Again, that's
nordvpn.com/gonkey.
Now, let's get started. Deep learning is
all about neural networks. So, of
course, there's many different ways that
neurons can be connected to each other.
The most basic type of neural network
consists of what are known as fully
connected layers, where every neuron in
each layer is connected to every neuron
in the next layer. But in this video
we'll find that the process of image
generation with stable diffusion is
largely dependent on two special types
of network layers, each serving a very
important role. Here, I'm going to
introduce the first one called the
convolutional layer. Pay attention
because the second type of layer will
come along much later in the video. And
the way that it relates to convolutional
layers is kind of amazing. You see
basic fully connected layers work well
for many different types of data, but
not images because images have way too
many pixels. Imagine you wanted to do an
operation on a 100x100 image, which
outputs a new 100x100 image. Even if the
image was black and white and only had
one channel, that's 100 * 100 equals
10,000 pixels. So, it's 10,000 inputs
and 10,000 outputs, which means there's
10,000 * 10,000 equals 100 million
neuron connections just for a 100* 100
image. And in addition, in a fully
connected layer, each input contributes
equally to each output. Which means the
relative spatial position of each pixel
is irrelevant, which kind of doesn't
make sense because obviously in an
image, pixels that are closer to each
other are more important in making up
features such as an edge compared to two
random pixels that are really far away
from each other. So for images, a better
type of layer is the convolutional layer
where each output pixel is determined by
a grid of all the surrounding input
pixels. And this is done with a 2D grid
of numbers called a kernel. Usually with
a size of like 3x3 or 5x5 where the
output pixel is determined by
multiplying the surrounding input pixels
by the corresponding number in the
kernel and then adding everything up.
For example, here's a vertical edge
detection kernel and here's a horizontal
edge detection kernel. You should start
to see why convolutions work so well for
images. If we have a 5x5 kernel instead
of a fully connected layer for a 100 by
100 image, that's only 25 parameters
that we can reuse over and over again
instead of 100 million. So now that you
understand how convolutions work, it's
time to talk about its significance to
computer vision, which is basically the
field of identifying what's in an image.
Level one of computer vision is simply
image classification, where the network
just labels what is in an image. we have
to assume there's only one object in the
image and we don't know where exactly it
is but we know what it is. Now level two
is classification with localization
where we can also only have one object
but the network also gives us a bounding
box which tells us where it is in the
image. Level three is object detection.
So now the image can have multiple
objects and we get multiple bounding
boxes and labels around each of them.
But it's still a very rough estimate of
what pixels in the image is that object
because all the bounding boxes are just
rectangles. So it's at level four which
is semantic segmentation that each pixel
in the image gets labeled for what it
is. Now we can have the exact shape of
whatever it is in the image that we want
to identify and this is good for things
like background removal. Level five is
instance segmentation where not only
does the program classify what thing
each pixel is, it can also identify
multiple instances of that thing. Like
if there's multiple people in a picture
the invention of stable diffusion starts
with level four semantic segmentation
and specifically for biomemed images. So
we're talking about images of cells
neurons, blood vessels, organs and
whatnot. This is helpful for uh
diagnosing diseases, researching anatomy
stuff. Okay, I'm going to be honest.
It's not that important why biomedical
image segmentation is important. All
that you need to know is that people
were trying to segment images of cells.
And if you're thinking, what on earth
does this have to do with image
generation? Well, I promise you it's
going to make sense in a bit. And that's
when the genius comes. For a while
image segmentation was inefficient and
required thousands and thousands of
training samples. for biomedical image
tasks. A lot of times there weren't
enough, images., Or, at least, that, was, the
case until 2015 when a group of computer
scientists submitted a paper proposing a
new network architecture which would
then go on to be cited by over 60,000
people. This is definitely one of the
more influential breakthroughs in
machine learning. Let's talk about the
unit.
A unit is full of convolutional layers
in order to do semantic segmentation on
cell images. But it's kind of weird in
how it does it because it first scales
down the image to a really low
resolution and then it scales it back up
to its original resolution. That sounds
counterintuitive at first, but it's kind
of genius. And I'm going to demonstrate
with a unit that I wrote myself. I wrote
this unit for this fish data set on
Kaggle from which I acquired 500 images
of different fish at a fish market along
with their corresponding black and white
masks of what pixels in the image make
up the fish. If you don't know, Kaggle
is a website dedicated to data science
and machine learning. And I'll leave a
link to the data set in the description.
So, the black and white masks are what's
known as the ground truth because that's
what we are comparing the network's
outputs against and training the network
to try to achieve. At first, the unit's
output when you give it this fish is not
so meaningful, but after a bit of
training,
it's able to identify the shape a lot
better. The colors are like this because
the values are outside the range of 0 to
1. So the image is rendered in what's
known as pseudo color. But if we clamp
it to between 0 and one, the final
output is essentially the same as the
provided masks. Now that we have a
trained network, it's time to open it up
to see what's inside and figure out how
does a unit segment images so
efficiently. Remember, prior methods
required thousands of sample images, but
I've only given this one 500 images and
it's doing pretty well. So when this RGB
image of a fish gets inputed into the
unit, it's represented in computer
memory as a 3D grid of numbers because
it has a width, height, and three
channels. So this is a three-dimensional
tensor in machine learning language. Now
at the start, the image only has three
channels to represent redness
greenness, and bless. But what if it
could have more channels to represent
more information like what part of the
image corresponds to the body of the
fish, what part is the cutting board
what parts the shadow, what parts the
highlights, and so on. So that's
essentially the whole point of
convolutions. It's to extract features
from an image from how the pixels relate
to each other. And what makes
convolutions even more powerful is when
there's more channels in the image than
just one channel, because then the
kernel is a 3D grid instead of just a 2D
one.
The first half of the unit has all these
convolutional blocks that makes the
number of channels in the image go from
3 to 64 to 128 to 256 to 512 and finally
to 1,024
in the convolution from 64 to 128
channels. For example, each kernel is 64
layers deep and there's 128 of those
kernels. That's how the network can
extract more and more complex features
from the image. Slight issue though.
Even though the kernels get deeper and
deeper, they still have a fixed field of
view on the image. In this case, a 3x3
field of view. In order to better
extract features from the image
obviously the kernels are going to have
to see more of the image. So, how can we
make the field of view bigger? Well
just making the kernels bigger rapidly
increases the number of parameters that
we have, which makes it inefficient. So
the unit uses a really smart and
efficient alternative. If we can't make
the kernels bigger, then just make the
image smaller. So after every two
convolutional blocks, the image gets
scaled down before it goes into the next
two convolutional blocks. This increased
field of view is how the network can
capture more context within the image to
better understand it. So let's see what
our fish has turned into in the middle
of the unit, where there's the most
number of channels, but the resolution
is the smallest. Out of the 124
channels, we can see that some of them
highlight the body of the fish, some of
them the background, some of them
highlight the brighter area above the
fish, and some of them the darker area
below. Just as we said before. So, at
this point, the network has learned all
the information on what is in the image
but the downscaling has made it lose
information on where it is in the image.
So in the second half of the unit, we
start scaling it back up again and
decrease the number of channels using
these convolutional blocks to kind of
consolidate and summarize up all that
information that we gathered in the
first half. But how do we get back all
the lost detail from the downsampling in
the first half? The answer is what's
known as residual connections where
every time the resolution is increased
the information from the previous time
the image was that resolution is
literally just slapped onto the back and
combined with it. And then the
convolutional layers mix the information
back in. If we compare the fish image at
its highest resolution in the beginning
to where it's at its highest resolution
in the end, we can see that the
different parts of the image are much
better segmented. And that's how through
one final convolution, we get this very
clean mask. Yeah. So, units are really
good at segmenting images. There was
this international image segmenting
competition where the people who
invented the unit just went in there and
demolished everyone. Here's them getting
the award for it. What a bunch of nerds
to be honest. No, I'm just kidding. I
mean that in an endearing way. But
anyways, okay. When are we actually
going to get to the image generation?
We're getting there. Listen up. The UNET
is so good at identifying things within
an image that people started using it
for other stuff other than semantic
segmentation. Specifically, it could be
used to dn noiseise an image. If a noisy
image is just the sum of the original
image plus some noise, then if you
identify the noise in the image, then
you can just minus it away to get the
original. In fact, that's exactly what
we're going to try to do. So, allow me
to demonstrate with another image of a
fish. This time with a resolution of 64x
64. This time, there is no black and
white ground truth mask to go with it.
Instead, we generate a bunch of noise to
be our ground truth because that's what
we're trying to train the network to
identify. It's important that during
training, we train on many copies of the
image with different amounts of noise
added in so that it's able to dn
noiseise really noisy images as well as
not so noisy ones. And here's where an
interesting challenge arises. How do we
provide the network with the knowledge
of how noisy each image sample is?
Because that's obviously going to affect
the outcome. So if you imagine all the
possible noise levels placed in a
sequence, the information here of how
noisy any sample is is basically a
number of that sample's position in that
sequence. So this is called positional
encoding. So let's say for this
particular image, its noise level
corresponds to the 10th position in the
sequence. Now we've got a 64x 64 image
with three channels, meaning there's
12,288
numbers in total. Do we just slap a 10
on the end making it 12,289 numbers? Is
that going to work? No. So, here's how
positional encoding works. And I get it.
You might be thinking, okay, this seems
like not such a significant detail. Why
do we need to go through it? This might
be like the fifth time I've said this
but it's going to come up later again.
It's going to be important. Positional
encoding is a type of embedding which is
when you take discrete variables like
words, hint later on, or in this case
positions in a sequence and turn it into
a vector of continuous numbers to feed
to the network as a more digestible form
of information that it can then use. The
way that our 10 gets converted into a
vector of continuous numbers is using
these sign and cause equations here. So
that the vector of numbers always stays
within a fixed range. But each position
is encoded by a unique combination of
numbers in the vector since the
different elements of the vector are
given by sign and cos functions of
different frequencies. And then it gets
added onto the image data repeatedly at
every point in the unit where it changes
resolution to really drill in the
information of how much noise is in the
image just to help the network get it
right. Okay, that was an information
overload, but we can finally start the
training process. And you can see that
at first the noise that the network
predicts is obviously way off. It looks
nothing like the actual noise that we
gave it. But after a while, it looks
pretty similar to the ground truth
noise. And we can't really notice any
improvements anymore. So let's instead
show the dnoised version where we minus
this prediction to get an image and see
how that improves.
So yes, as you can see, we do end up
with the original fish image, but it is
kind of low quality and blurry. This is
because trying to go from pure noise to
the original image all in one step is
too hard. So instead, what we're going
to do is we don't get rid of all the
noise at once. We only get rid of some
of it. And then we feed it back into the
network again and get rid of a little
bit more. And then again, and then
again. And as you can see, removing the
noise in small baby steps like this
eventually gives us the clear, high
quality original image. And this is why
we've had to feed the network images of
varying degrees of noise to train on
because that's how it can work for the
whole process of dnoising.
Now, it's obvious that the network's
going to give us the same fish all the
time because we only trained it on one
image. But allow me to demonstrate what
happens when I take 5,000 32x32 images
of ships from the famous Sciphar 10 data
set. The Sciphar 10 data set has 10
classes corresponding to 10 different
objects, each with thousands of 32x32
images. Now, I'm not going to lie, at
first I accidentally put all 10 classes
of images into the network, so it was
going 10 times slower than it needed to
be, and I stopped it early. But then I
looked at some of the results and while
most of it was nonsense, here is what I
believe to be a red panda wearing
sunglasses and using a green tent as a
turtle shell. And this I think is an
orange boat wearing ice skating shoes
with a mohawk.
I don't know how this happens. Maybe
that's the magic of AI. But anyways, I
started training it again on the ship
images. And at first it was giving us
rubbish. But eventually we can see that
it comes up with completely new images
of ships using the knowledge that it's
gathered. And that my friend is called a
diffusion model. Now maybe it's come to
your attention that what we have so far
is not very efficient. I mean it took
forever to train on these 32x32 images.
Imagine how long it's going to take for
an HD or a 4K image. And the reason is
we're doing the noise prediction and
dnoising directly on the pixels right
now. And there's a lot of pixels meaning
a lot of data. So let's think about
whether there's a way we can reduce the
amount of data that we have to work with
to speed up this process. Imagine this.
I show you this image right here and you
have to tell your friend what's in the
image. Are you going to read out all the
values of the individual pixels to your
friend in order to transfer the
information over? No. You'll tell them a
description of blue sofa in a white room
with a cactus to its right and a coffee
table in front. And then they can use
their life experience and knowledge of
different objects to kind of imagine and
reconstruct what it's roughly supposed
to look like. It's not going to look
exactly the same, but it's going to be
good enough. Let's use another example
more relevant to computers. Usually, we
don't store images as just their raw
uncompressed pixel values, and instead
we use a file format like JPEG, which
can reduce the amount of data by many
many times. And then when the file gets
decoded to display on your screen, it's
a bit lower quality than the original
but again, it's good enough. Now, notice
how in both of these examples, there's a
process of encoding, which is you coming
up with a phrase to describe the image
or the JPEG compression. And then
there's a process of decoding, which is
your friend imagining what the image is
supposed to look like, or the JPEG
decompression. So what people invented
is a neural network equivalent of this
known as autoenccoders which are
basically trained to encode data into
what's known as a latent space and then
decoded the best that it can back to the
original data. Here's a demo of a latent
space that's been trained for the amnest
digit data set. In this case, the 28x 28
equals 784 pixels got encoded into just
two numbers, which means we can
visualize it as a two-dimensional space
and drag around this point to see what
the different areas correspond to when
it's decoded. Now, of course, if it's
five or 10 numbers in the latent space
instead of two, you can get higher
fidelity reconstructions. In stable
diffusion, RGB images which are 512 x
512 corresponding to 786,000
numbers are encoded into a latent space
that's 4x 64x 64 equals 16,000 numbers.
Now that's 150th of the original amount
of data. So instead of directly adding
noise to images in their pixel space and
then dnoising those images, the images
are first encoded into this latent space
and then we noise and dn noiseise that
and then when it's decoded we roughly
get the original image again. This is
called a latent diffusion model and it's
one of the key improvements to the basic
diffusion model because it's so many
times faster than running dinoising on
the raw uncompressed data. Now, up until
this point, we still haven't addressed
something very important. How do we make
it generate images based on a text
prompt? Not going to lie, I think this
might be where it gets really hard to
understand. So, if you don't get
anything from here on out, don't stress
about it. Because if you made it this
far in the video, that's already pretty
impressive. But anyways, here's where we
use that embedding concept from earlier.
All these words, which are discrete
variables, have to be encoded into
vectors, just like the sequence position
numbers representing how noisy the
images are. The way that people found
good embeddings for words is this method
called word tovec. I won't go into too
much detail, but basically they had a
list of a bunch of vectors, one for each
word in the English language. And
actually, they had two of these lists.
Then they use data of all the text
that's ever been written by humans from
books and the internet and whatnot to
try to adjust these two lists of word
vectors so that the vector of a word in
one list would be similar to vectors of
words that it often appears next to in
the other list. Similar meaning it has a
larger dot product. So for example, the
words tall building are more likely to
appear together than the words tall
electricity. So if you look at the
vector for tall in one list, it'll be
similar to the vector for building in
the other list, but not similar to the
vector for electricity in the other
list. So once you've trained it enough
the relationship between word vectors in
opposite lists is that the more likely
they are to appear next to each other
the more similar they are. But what does
this mean for the relationships between
words in the same list? in the same
list, the more likely they appear in
similar contexts, the more similar they
are. So, if we just take one of the
lists as the embedding vectors for all
the words and graph it out, we'll find
that words that are used in similar
contexts are grouped closer together.
Here's a cool visualization on
projector.tensorflow.org
where if you click a dot representing a
word, it shows you the closest words
around it. Of course, it seems as though
they're kind of spread out and they're
not really the closest words, but that's
because this only visualizes three
dimensions, while the word vectors
actually have 300 dimensions, meaning
each word is represented by 300 numbers.
With all those dimensions, it turns out
these word embeddings can actually
capture some of the more nuanced
relationships between words. And the
most famous example is if you take the
vector for king and you subtract the
vector for man but then you add the
vector for woman you end up with the
vector for queen. Another example is you
can take London subtract England add
Japan and then you end up with Tokyo. So
I hope you can see how genius this word
embedding vector space is. Now remember
how earlier I said there's two types of
network layers that are really important
to stable diffusion. And the first one
is the convolutional layer. Well, it's
time to introduce the second one which
is called the self attention layer.
Let's think back to convolutions for a
second. Convolutional layers extract
features from an image using
relationships between pixels where the
amount that each pixel influences
another pixel is dependent on their
relative spatial position. So, a self-
attention layer extracts features from a
phrase using the relationships between
the words where the amount of influence
words have on each other is determined
by their embedding vectors. To build up
the simplest possible self attention
layer, it's kind of like a fully
connected layer, but each input and
output is a vector instead of a single
number. And the weights of connections
are not parameters that it tries to
learn. Rather the weight of the
connection between A and B is determined
by the dotproduct between A and B. So in
the simplest model the output is
entirely dependent on the input since
there's no parameters that we can
control. But we do want to control it
obviously like if we have a convolution
where the kernel just has all the same
numbers then yeah sure it helps us
understand how a convolution works but
all it does is just blur the image and
it's not that useful. So how can we
control this self attention layer so
that just like we make a convolution
detect edges for example, we make it
detect, I don't know, words that negate
or emphasize certain adjectives.
Let's break this attention process down
into its components. In our simple
attention layer, the amount that A's
input influences B's output is
determined by the dot product. And the
amount that B's input influences A's
output is also determined by the dot
product. So it's the same. Now let's
focus on the part where A influences B.
I'm going to characterize this process
as a conversation between A and B, which
sounds hella goofy, but I promise it'll
make sense. So B goes up to A and says
"Hello, I'm B. Here is my ID. show me
your ID so that we can compare it and
decide how much you influence my output.
And then A says, "Yeah, I'm A. Here's my
ID and here's my data which I'll pass
over to your output after the
comparison." So B's ID is called the
query vector. A's ID is called the key
vector. The process of comparison is the
dotproduct between them and A's data is
called the value vector. There I just
explained query key and value and self
attention. In our simple self attention
of course, the query is just vector B
itself and both the key and the value
are just vector A. In other words, each
vector has to serve a total of three
purposes over the whole process even
though it's the same vector the whole
time. So here's how we introduce in
parameters to control this whole self
attention process. How do we manipulate
vectors using matrices?
So, we have three matrices called, can
you guess what they're called?
That's right, the query matrix, the key
matrix, and the value matrix. And these
are applied to each vector before they
go to serve their purpose as a query
key, or value. Now, the amount that A
influences B is no longer the same as
the amount that B influences A. This way
it's not just similar vectors that
influence each other anymore due to the
simple dot product. Now we can use any
feature in each word's massive 300
dimension vector which is what makes
self attention layers so powerful in
extracting features in the relationships
between words. One more thing though
right now the output is still not
affected by the relative position of
each word in a phrase and that's really
important in determining the meaning of
the phrase. So in order to encode the
position of each word, we once again use
the positional encoding method covered
earlier and just add on those positional
embedding vectors to the word embedding
vectors. Wow, that was another
information overload. You need a break?
Okay, let's take a break.
All right,, let's, resume., Think, about
this. Using convolutional layers, we can
encode an image into a small embedding
vector. And using attention layers, we
can also encode a text phrase into a
small embedding vector. But look at the
image and the text that I've selected to
display here. The caption perfectly
describes the image. So imagine if the
two encoders could come up with the same
embedding vector even though they're
dealing with two different types of
data. So that's exactly what Open AI
tried to do with their clip text model
where clip stands for contrastive
language image pre-training. This clip
text model has both an image encoder and
a text encoder and they trained it on
400 million images so that images and
captions that match are supposed to come
out with very similar embeddings and the
ones that don't match are supposed to
come out with really different
embeddings. So it kind of makes sense
that the text embeddings that come out
of this clip text model which are
already matched to encoded images are
perfect to stick into our dnoising unit
which also encodes and decodes images as
a part of how it works with the
convolutions and scaling and all that.
So yes in stable diffusion we just take
text embeddings generated by clip and
inject it into the unit multiple times
using attention layers. Well, this time
a slightly different type of attention.
This time it's not self attention which
just operates on one set of input
vectors. We're adding the text
information into the image. So obviously
it's two sets of input data. So instead
it's a process called cross attention
which is literally like self attention
except the image is going to be the
query and the text is going to be the
key and value. That's it. These cross
attention layers in the middle of the
unit are going to extract relationships
between the image and the text. So that
whatever features in the image can get
influenced by the most important and
relevant features in the text. And this
is how we eventually train the network
to generate images based on the text
captions we give it. So there we have
it. Convolutional layers learn images.
Self attention layers learn text. And
then when you combine the two, you can
generate images based on text. Pretty
interesting, is it