TubeSum ← Transcribe a video

How Stable Diffusion Works (AI Image Generation)

Transcribed Jun 28, 2026 Watch on YouTube ↗
Advanced 10 min read For: Individuals with a basic understanding of machine learning concepts who want a conceptual (but not mathematical) explanation of Stable Diffusion's architecture.
192.8K
Views
7.3K
Likes
254
Comments
124
Dislikes
3.9%
📈 Moderate

AI Summary

The video explains how Stable Diffusion works, a leading AI image generation model. It covers key concepts like convolutional neural networks, the UNet architecture, diffusion models, and how text prompts are encoded and used to guide image generation. The goal is to provide a conceptual understanding without heavy math.

[03:23]
Convolutional Layers vs Fully Connected

Convolutional layers are more efficient than fully connected layers for images because they use a small kernel (e.g., 3x3) to process local pixel groups, drastically reducing parameters (e.g., 25 vs 100 million for a 100x100 image).

[07:24]
UNet's Upsample-Downsample Architecture

The UNet architecture first downsamples an image to a low resolution to extract features, then upsamples it back to the original size. This is efficient for semantic segmentation.

[10:34]
Feature Extraction and Field of View

The UNet increases the number of channels (e.g., from 3 to 1024) to extract increasingly complex features, and downsamples to increase the field of view without increasing kernel size.

[12:40]
Denoising with a UNet

A UNet can be used for denoising by training it to predict the noise added to an image. Subtracting the predicted noise recovers a cleaner image. The process is done in many small steps for high quality.

[14:37]
Positional Encoding for Noise Level

Positional encoding using sine and cosine functions converts a discrete noise level (position) into a continuous vector that can be injected into the network.

[20:35]
Latent Diffusion Model for Speed

To speed up training, stable diffusion uses a latent diffusion model. An autoencoder compresses the image (e.g., 512x512 pixels to 4x64x64 latent space), reducing data by 50x. Denoising happens in this latent space.

[23:11]
Word Embeddings Capture Semantics

Word embeddings (e.g., from Word2Vec or CLIP) are vectors that capture semantic meaning. For example, 'king' - 'man' + 'woman' results in a vector close to 'queen'.

[25:56]
Self-Attention for Text

Self-attention layers process text by comparing query, key, and value vectors (derived from learned matrices) to understand relationships between words in a phrase.

[29:49]
Cross-Attention: Image Meets Text

In Stable Diffusion, cross-attention layers are used at multiple points in the UNet. The image data provides the query, and the text embeddings provide the key and value, allowing the text to guide image generation.

Clickbait Check

95% Legit

"The title is completely accurate; the video is a highly technical, deep dive into the architecture of Stable Diffusion."

Mentioned in this Video

Tutorial Checklist

1 20:35 Open the autoencoder to encode the image into a latent space (e.g., 4x64x64).
2 12:40 Use a UNet to denoise the latent image in many small steps, guided by positional encoding of the noise level.
3 29:49 Encode the text prompt into embeddings using CLIP. Inject these into the UNet using cross-attention layers.
4 07:24 In the UNet, use convolutional layers to extract features from the latent image, downsampling and upsampling as needed.
5 25:56 Use self-attention layers to process the text embeddings, and cross-attention to combine them with image features.
6 20:35 Decode the denoised latent representation back into the final pixel image.

Study Flashcards (8)

Why are convolutional layers better than fully connected layers for processing images?

easy Click to reveal answer

Convolutional layers use small grids (kernels) that slide over the image, focusing on local spatial relationships, which is far more efficient for images than fully connected layers.

03:23

What is the key architectural trick of the UNet for efficient segmentation?

medium Click to reveal answer

It scales the image down to a very low resolution (contracting path) and then scales it back up (expanding path), using residual connections to preserve fine detail.

07:24

Why does the UNet scale down the image resolution in its first half?

medium Click to reveal answer

To increase the field of view of the kernels without increasing their size, allowing the network to capture more context.

10:34

How does a UNet denoise an image (as opposed to segmenting it)?

hard Click to reveal answer

It identifies the noise in the image, so it can be subtracted away to recover the original image.

12:40

How is positional encoding typically implemented to tell a network how noisy an image is?

hard Click to reveal answer

Positional encoding uses sine and cosine functions of different frequencies to create a unique vector for each position, which is then added to the data.

14:37

What is the primary advantage of a latent diffusion model over a standard diffusion model?

medium Click to reveal answer

To speeds up training and inference by working in a compressed latent space (e.g., 1/50th of the original data) rather than directly on pixels.

20:35

In self-attention, what are the three learnable matrices used to transform input vectors?

hard Click to reveal answer

The query matrix, key matrix, and value matrix.

26:54

What is the key difference between self-attention and cross-attention as used in Stable Diffusion?

hard Click to reveal answer

Cross-attention is like self-attention, but the query comes from one set of data (the image) and the key/value come from another (the text).

29:39

💡 Key Takeaways

🔧

UNet's Upsample-Downsample Architecture

Explains the counterintuitive but genius approach of compressing an image to extract features and then expanding it for segmentation, making training very data-efficient.

07:24
💡

Latent Diffusion Speed-Up

Shows the key innovation that made diffusion models practical: working in a compressed latent space, which is 50x faster than operating on raw pixels.

20:35
🔧

Cross-Attention for Text Guidance

Explains how image features and text embeddings are combined using cross-attention, which is the core mechanism that allows text prompts to guide image generation.

29:49
📊

Field of View via Image Scaling

Illustrates a practical deep learning trick: instead of making convolution kernels larger (which adds parameters), you make the image smaller to increase the network's receptive field.

10:34

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

Artists Losing Jobs to AI

44s

Opens with a controversial and relatable statement about AI replacing artists, grabbing viewer attention.

▶ Play Clip

AI Denoising in Baby Steps

43s

Visually demonstrates the core iterative denoising process of diffusion models, which is both educational and mesmerizing.

▶ Play Clip

The Genius of Latent Space

57s

Uses an intuitive analogy (describing an image to a friend) to explain a complex concept, making it highly shareable and easy to understand.

▶ Play Clip

King − Man + Woman = Queen

35s

Features the famous mind-blowing word embedding example that perfectly showcases AI's ability to understand relationships, guaranteed to spark curiosity.

▶ Play Clip

[00:00] We live in a world where artists are

[00:02] losing their jobs because you can

[00:03] generate whatever piece of art you want

[00:05] with a simple text prompt within a few

[00:08] seconds that looks incredibly good. More

[00:11] than that, you can generate an image of

[00:13] any thing, even things that don't exist

[00:16] in real life, just by using the right

[00:18] descriptions. What happened? Just last

[00:20] week, I spent 2 hours trying to connect

[00:22] to a wireless printer. How did the

[00:24] computers get here? This video will be

[00:26] highly technical and try to explain how

[00:28] stable diffusion works, which is

[00:29] currently the best method of image

[00:31] generation that we have, beating out

[00:32] older technology like generative

[00:35] adversarial networks or GANs. Now

[00:37] you've seen the length of the video.

[00:39] It's a long video, but if you go search

[00:40] up other machine learning videos online

[00:42] this will probably still be the least

[00:44] technical out of all of them because

[00:46] I've tried to cut out all the math to

[00:48] make everything conceptually easier to

[00:50] understand while keeping the information

[00:52] mostly accurate. Look, I know you're

[00:54] passionate and curious about this

[00:55] technology. So, I want you to try to pay

[00:57] attention. Even though, if I'm going to

[00:58] be honest, you probably won't understand

[01:00] a lot of it the first time you watch it

[01:02] through. If you can grasp the intuition

[01:03] and concepts well, and then you decide

[01:05] to look more at the math and derivations

[01:07] or to pursue a career in this field

[01:10] everything will be a lot easier to

[01:11] understand. And that's part of why I

[01:13] made this video. AI is the future. And

[01:15] you know what they say, when there's a

[01:17] gold rush, sell shovels. Now, a lot of

[01:19] people are worried about AI safety

[01:20] thinking AI is going to take over the

[01:22] world. Me personally, I'm not too

[01:24] pressed because I can't even get Chad

[01:26] GBT to solve a simple math problem

[01:28] correctly. But what I can say you should

[01:29] be worried about is cyber security

[01:31] which our video partner today, NordVPN

[01:34] wants you to learn about. The process of

[01:35] making this video involved a lot of

[01:36] research and developing neural networks

[01:38] on Google Collab, all of which uses the

[01:41] internet, and sometimes I'd be away from

[01:42] home at a public library or cafe using

[01:45] the free Wi-Fi. Now, you'd be surprised

[01:47] how easy it is to compromise these

[01:49] public networks or make a fake network

[01:51] just to steal your data. And this is

[01:53] called a man-in-the-middle attack. So

[01:55] to make sure my bank account information

[01:56] and passwords don't get stolen by a

[01:58] hacker, I use NordVPN to make sure my

[02:00] internet connection was securely

[02:02] encrypted at all times. And other than

[02:04] that, NordVPN also has a bunch of other

[02:06] features like their threat protection

[02:07] and dark web monitor features to protect

[02:09] you against fishing attacks, password

[02:12] leaks, malware, ransomware, and the

[02:14] like. I mainly use NordVPN for security

[02:16] but sometimes I can also use it to watch

[02:17] shows that are only available in certain

[02:19] other countries or get plane tickets for

[02:21] cheaper prices in other areas of the

[02:23] world. For example, if I want to connect

[02:25] to Japan's internet, I just click on

[02:27] Tokyo and I'm there. So, NordVPN is

[02:30] offering an exclusive deal if you go to

[02:32] nordvpn.com/gonkey

[02:34] where you can get a 2-year plan with

[02:35] extra months for free with a 30-day

[02:37] money back guarantee. I'll leave that

[02:39] link in the description and pinned

[02:40] comment below. Again, that's

[02:42] nordvpn.com/gonkey.

[02:44] Now, let's get started. Deep learning is

[02:46] all about neural networks. So, of

[02:48] course, there's many different ways that

[02:49] neurons can be connected to each other.

[02:51] The most basic type of neural network

[02:53] consists of what are known as fully

[02:55] connected layers, where every neuron in

[02:57] each layer is connected to every neuron

[03:00] in the next layer. But in this video

[03:02] we'll find that the process of image

[03:04] generation with stable diffusion is

[03:06] largely dependent on two special types

[03:08] of network layers, each serving a very

[03:11] important role. Here, I'm going to

[03:13] introduce the first one called the

[03:14] convolutional layer. Pay attention

[03:16] because the second type of layer will

[03:18] come along much later in the video. And

[03:19] the way that it relates to convolutional

[03:22] layers is kind of amazing. You see

[03:23] basic fully connected layers work well

[03:26] for many different types of data, but

[03:27] not images because images have way too

[03:30] many pixels. Imagine you wanted to do an

[03:32] operation on a 100x100 image, which

[03:35] outputs a new 100x100 image. Even if the

[03:39] image was black and white and only had

[03:40] one channel, that's 100 * 100 equals

[03:43] 10,000 pixels. So, it's 10,000 inputs

[03:46] and 10,000 outputs, which means there's

[03:49] 10,000 * 10,000 equals 100 million

[03:52] neuron connections just for a 100* 100

[03:55] image. And in addition, in a fully

[03:56] connected layer, each input contributes

[03:59] equally to each output. Which means the

[04:01] relative spatial position of each pixel

[04:03] is irrelevant, which kind of doesn't

[04:06] make sense because obviously in an

[04:07] image, pixels that are closer to each

[04:09] other are more important in making up

[04:11] features such as an edge compared to two

[04:13] random pixels that are really far away

[04:14] from each other. So for images, a better

[04:16] type of layer is the convolutional layer

[04:19] where each output pixel is determined by

[04:21] a grid of all the surrounding input

[04:23] pixels. And this is done with a 2D grid

[04:26] of numbers called a kernel. Usually with

[04:28] a size of like 3x3 or 5x5 where the

[04:32] output pixel is determined by

[04:34] multiplying the surrounding input pixels

[04:36] by the corresponding number in the

[04:37] kernel and then adding everything up.

[04:40] For example, here's a vertical edge

[04:42] detection kernel and here's a horizontal

[04:45] edge detection kernel. You should start

[04:47] to see why convolutions work so well for

[04:49] images. If we have a 5x5 kernel instead

[04:51] of a fully connected layer for a 100 by

[04:53] 100 image, that's only 25 parameters

[04:56] that we can reuse over and over again

[04:58] instead of 100 million. So now that you

[05:00] understand how convolutions work, it's

[05:02] time to talk about its significance to

[05:04] computer vision, which is basically the

[05:06] field of identifying what's in an image.

[05:10] Level one of computer vision is simply

[05:12] image classification, where the network

[05:14] just labels what is in an image. we have

[05:17] to assume there's only one object in the

[05:19] image and we don't know where exactly it

[05:21] is but we know what it is. Now level two

[05:24] is classification with localization

[05:27] where we can also only have one object

[05:29] but the network also gives us a bounding

[05:31] box which tells us where it is in the

[05:33] image. Level three is object detection.

[05:36] So now the image can have multiple

[05:38] objects and we get multiple bounding

[05:40] boxes and labels around each of them.

[05:42] But it's still a very rough estimate of

[05:44] what pixels in the image is that object

[05:46] because all the bounding boxes are just

[05:48] rectangles. So it's at level four which

[05:51] is semantic segmentation that each pixel

[05:54] in the image gets labeled for what it

[05:56] is. Now we can have the exact shape of

[05:58] whatever it is in the image that we want

[06:00] to identify and this is good for things

[06:02] like background removal. Level five is

[06:04] instance segmentation where not only

[06:07] does the program classify what thing

[06:08] each pixel is, it can also identify

[06:11] multiple instances of that thing. Like

[06:13] if there's multiple people in a picture

[06:16] the invention of stable diffusion starts

[06:18] with level four semantic segmentation

[06:21] and specifically for biomemed images. So

[06:24] we're talking about images of cells

[06:26] neurons, blood vessels, organs and

[06:28] whatnot. This is helpful for uh

[06:30] diagnosing diseases, researching anatomy

[06:33] stuff. Okay, I'm going to be honest.

[06:35] It's not that important why biomedical

[06:37] image segmentation is important. All

[06:38] that you need to know is that people

[06:40] were trying to segment images of cells.

[06:42] And if you're thinking, what on earth

[06:43] does this have to do with image

[06:45] generation? Well, I promise you it's

[06:47] going to make sense in a bit. And that's

[06:48] when the genius comes. For a while

[06:50] image segmentation was inefficient and

[06:52] required thousands and thousands of

[06:54] training samples. for biomedical image

[06:57] tasks. A lot of times there weren't

[06:58] enough, images., Or, at least, that, was, the

[07:00] case until 2015 when a group of computer

[07:03] scientists submitted a paper proposing a

[07:05] new network architecture which would

[07:07] then go on to be cited by over 60,000

[07:10] people. This is definitely one of the

[07:12] more influential breakthroughs in

[07:14] machine learning. Let's talk about the

[07:16] unit.

[07:18] A unit is full of convolutional layers

[07:20] in order to do semantic segmentation on

[07:22] cell images. But it's kind of weird in

[07:24] how it does it because it first scales

[07:26] down the image to a really low

[07:28] resolution and then it scales it back up

[07:31] to its original resolution. That sounds

[07:33] counterintuitive at first, but it's kind

[07:35] of genius. And I'm going to demonstrate

[07:37] with a unit that I wrote myself. I wrote

[07:39] this unit for this fish data set on

[07:41] Kaggle from which I acquired 500 images

[07:44] of different fish at a fish market along

[07:47] with their corresponding black and white

[07:48] masks of what pixels in the image make

[07:51] up the fish. If you don't know, Kaggle

[07:53] is a website dedicated to data science

[07:55] and machine learning. And I'll leave a

[07:56] link to the data set in the description.

[07:58] So, the black and white masks are what's

[08:00] known as the ground truth because that's

[08:02] what we are comparing the network's

[08:04] outputs against and training the network

[08:06] to try to achieve. At first, the unit's

[08:09] output when you give it this fish is not

[08:11] so meaningful, but after a bit of

[08:13] training,

[08:18] it's able to identify the shape a lot

[08:20] better. The colors are like this because

[08:22] the values are outside the range of 0 to

[08:25] 1. So the image is rendered in what's

[08:26] known as pseudo color. But if we clamp

[08:29] it to between 0 and one, the final

[08:31] output is essentially the same as the

[08:32] provided masks. Now that we have a

[08:34] trained network, it's time to open it up

[08:36] to see what's inside and figure out how

[08:38] does a unit segment images so

[08:40] efficiently. Remember, prior methods

[08:42] required thousands of sample images, but

[08:44] I've only given this one 500 images and

[08:47] it's doing pretty well. So when this RGB

[08:49] image of a fish gets inputed into the

[08:51] unit, it's represented in computer

[08:53] memory as a 3D grid of numbers because

[08:56] it has a width, height, and three

[08:58] channels. So this is a three-dimensional

[09:00] tensor in machine learning language. Now

[09:03] at the start, the image only has three

[09:05] channels to represent redness

[09:07] greenness, and bless. But what if it

[09:10] could have more channels to represent

[09:12] more information like what part of the

[09:15] image corresponds to the body of the

[09:17] fish, what part is the cutting board

[09:19] what parts the shadow, what parts the

[09:20] highlights, and so on. So that's

[09:22] essentially the whole point of

[09:23] convolutions. It's to extract features

[09:26] from an image from how the pixels relate

[09:28] to each other. And what makes

[09:30] convolutions even more powerful is when

[09:32] there's more channels in the image than

[09:34] just one channel, because then the

[09:36] kernel is a 3D grid instead of just a 2D

[09:38] one.

[09:40] The first half of the unit has all these

[09:41] convolutional blocks that makes the

[09:43] number of channels in the image go from

[09:45] 3 to 64 to 128 to 256 to 512 and finally

[09:51] to 1,024

[09:53] in the convolution from 64 to 128

[09:56] channels. For example, each kernel is 64

[09:59] layers deep and there's 128 of those

[10:01] kernels. That's how the network can

[10:04] extract more and more complex features

[10:06] from the image. Slight issue though.

[10:08] Even though the kernels get deeper and

[10:09] deeper, they still have a fixed field of

[10:12] view on the image. In this case, a 3x3

[10:15] field of view. In order to better

[10:17] extract features from the image

[10:18] obviously the kernels are going to have

[10:20] to see more of the image. So, how can we

[10:22] make the field of view bigger? Well

[10:25] just making the kernels bigger rapidly

[10:27] increases the number of parameters that

[10:29] we have, which makes it inefficient. So

[10:30] the unit uses a really smart and

[10:32] efficient alternative. If we can't make

[10:35] the kernels bigger, then just make the

[10:38] image smaller. So after every two

[10:40] convolutional blocks, the image gets

[10:42] scaled down before it goes into the next

[10:44] two convolutional blocks. This increased

[10:46] field of view is how the network can

[10:48] capture more context within the image to

[10:51] better understand it. So let's see what

[10:53] our fish has turned into in the middle

[10:55] of the unit, where there's the most

[10:56] number of channels, but the resolution

[10:58] is the smallest. Out of the 124

[11:01] channels, we can see that some of them

[11:03] highlight the body of the fish, some of

[11:05] them the background, some of them

[11:07] highlight the brighter area above the

[11:08] fish, and some of them the darker area

[11:11] below. Just as we said before. So, at

[11:14] this point, the network has learned all

[11:16] the information on what is in the image

[11:18] but the downscaling has made it lose

[11:20] information on where it is in the image.

[11:23] So in the second half of the unit, we

[11:25] start scaling it back up again and

[11:27] decrease the number of channels using

[11:28] these convolutional blocks to kind of

[11:31] consolidate and summarize up all that

[11:33] information that we gathered in the

[11:34] first half. But how do we get back all

[11:36] the lost detail from the downsampling in

[11:38] the first half? The answer is what's

[11:40] known as residual connections where

[11:42] every time the resolution is increased

[11:45] the information from the previous time

[11:47] the image was that resolution is

[11:49] literally just slapped onto the back and

[11:50] combined with it. And then the

[11:52] convolutional layers mix the information

[11:54] back in. If we compare the fish image at

[11:56] its highest resolution in the beginning

[11:59] to where it's at its highest resolution

[12:01] in the end, we can see that the

[12:02] different parts of the image are much

[12:04] better segmented. And that's how through

[12:07] one final convolution, we get this very

[12:09] clean mask. Yeah. So, units are really

[12:13] good at segmenting images. There was

[12:14] this international image segmenting

[12:16] competition where the people who

[12:18] invented the unit just went in there and

[12:20] demolished everyone. Here's them getting

[12:21] the award for it. What a bunch of nerds

[12:23] to be honest. No, I'm just kidding. I

[12:25] mean that in an endearing way. But

[12:27] anyways, okay. When are we actually

[12:29] going to get to the image generation?

[12:31] We're getting there. Listen up. The UNET

[12:33] is so good at identifying things within

[12:36] an image that people started using it

[12:38] for other stuff other than semantic

[12:40] segmentation. Specifically, it could be

[12:42] used to dn noiseise an image. If a noisy

[12:44] image is just the sum of the original

[12:46] image plus some noise, then if you

[12:49] identify the noise in the image, then

[12:52] you can just minus it away to get the

[12:54] original. In fact, that's exactly what

[12:56] we're going to try to do. So, allow me

[12:58] to demonstrate with another image of a

[13:00] fish. This time with a resolution of 64x

[13:03] 64. This time, there is no black and

[13:06] white ground truth mask to go with it.

[13:09] Instead, we generate a bunch of noise to

[13:10] be our ground truth because that's what

[13:13] we're trying to train the network to

[13:14] identify. It's important that during

[13:16] training, we train on many copies of the

[13:18] image with different amounts of noise

[13:20] added in so that it's able to dn

[13:22] noiseise really noisy images as well as

[13:25] not so noisy ones. And here's where an

[13:28] interesting challenge arises. How do we

[13:30] provide the network with the knowledge

[13:32] of how noisy each image sample is?

[13:35] Because that's obviously going to affect

[13:36] the outcome. So if you imagine all the

[13:39] possible noise levels placed in a

[13:41] sequence, the information here of how

[13:44] noisy any sample is is basically a

[13:47] number of that sample's position in that

[13:50] sequence. So this is called positional

[13:52] encoding. So let's say for this

[13:54] particular image, its noise level

[13:56] corresponds to the 10th position in the

[13:58] sequence. Now we've got a 64x 64 image

[14:01] with three channels, meaning there's

[14:03] 12,288

[14:05] numbers in total. Do we just slap a 10

[14:08] on the end making it 12,289 numbers? Is

[14:11] that going to work? No. So, here's how

[14:13] positional encoding works. And I get it.

[14:15] You might be thinking, okay, this seems

[14:17] like not such a significant detail. Why

[14:20] do we need to go through it? This might

[14:21] be like the fifth time I've said this

[14:23] but it's going to come up later again.

[14:25] It's going to be important. Positional

[14:26] encoding is a type of embedding which is

[14:29] when you take discrete variables like

[14:32] words, hint later on, or in this case

[14:35] positions in a sequence and turn it into

[14:38] a vector of continuous numbers to feed

[14:41] to the network as a more digestible form

[14:44] of information that it can then use. The

[14:47] way that our 10 gets converted into a

[14:49] vector of continuous numbers is using

[14:51] these sign and cause equations here. So

[14:53] that the vector of numbers always stays

[14:55] within a fixed range. But each position

[14:57] is encoded by a unique combination of

[15:00] numbers in the vector since the

[15:02] different elements of the vector are

[15:03] given by sign and cos functions of

[15:05] different frequencies. And then it gets

[15:07] added onto the image data repeatedly at

[15:10] every point in the unit where it changes

[15:12] resolution to really drill in the

[15:14] information of how much noise is in the

[15:17] image just to help the network get it

[15:19] right. Okay, that was an information

[15:22] overload, but we can finally start the

[15:24] training process. And you can see that

[15:25] at first the noise that the network

[15:27] predicts is obviously way off. It looks

[15:29] nothing like the actual noise that we

[15:30] gave it. But after a while, it looks

[15:32] pretty similar to the ground truth

[15:33] noise. And we can't really notice any

[15:36] improvements anymore. So let's instead

[15:38] show the dnoised version where we minus

[15:41] this prediction to get an image and see

[15:44] how that improves.

[15:50] So yes, as you can see, we do end up

[15:53] with the original fish image, but it is

[15:55] kind of low quality and blurry. This is

[15:57] because trying to go from pure noise to

[16:00] the original image all in one step is

[16:02] too hard. So instead, what we're going

[16:04] to do is we don't get rid of all the

[16:07] noise at once. We only get rid of some

[16:09] of it. And then we feed it back into the

[16:11] network again and get rid of a little

[16:13] bit more. And then again, and then

[16:15] again. And as you can see, removing the

[16:17] noise in small baby steps like this

[16:19] eventually gives us the clear, high

[16:22] quality original image. And this is why

[16:24] we've had to feed the network images of

[16:26] varying degrees of noise to train on

[16:28] because that's how it can work for the

[16:30] whole process of dnoising.

[16:33] Now, it's obvious that the network's

[16:34] going to give us the same fish all the

[16:36] time because we only trained it on one

[16:38] image. But allow me to demonstrate what

[16:40] happens when I take 5,000 32x32 images

[16:44] of ships from the famous Sciphar 10 data

[16:47] set. The Sciphar 10 data set has 10

[16:50] classes corresponding to 10 different

[16:52] objects, each with thousands of 32x32

[16:55] images. Now, I'm not going to lie, at

[16:57] first I accidentally put all 10 classes

[16:59] of images into the network, so it was

[17:01] going 10 times slower than it needed to

[17:02] be, and I stopped it early. But then I

[17:04] looked at some of the results and while

[17:06] most of it was nonsense, here is what I

[17:09] believe to be a red panda wearing

[17:11] sunglasses and using a green tent as a

[17:14] turtle shell. And this I think is an

[17:17] orange boat wearing ice skating shoes

[17:20] with a mohawk.

[17:22] I don't know how this happens. Maybe

[17:23] that's the magic of AI. But anyways, I

[17:25] started training it again on the ship

[17:27] images. And at first it was giving us

[17:29] rubbish. But eventually we can see that

[17:30] it comes up with completely new images

[17:32] of ships using the knowledge that it's

[17:34] gathered. And that my friend is called a

[17:38] diffusion model. Now maybe it's come to

[17:40] your attention that what we have so far

[17:42] is not very efficient. I mean it took

[17:44] forever to train on these 32x32 images.

[17:47] Imagine how long it's going to take for

[17:49] an HD or a 4K image. And the reason is

[17:51] we're doing the noise prediction and

[17:53] dnoising directly on the pixels right

[17:55] now. And there's a lot of pixels meaning

[17:57] a lot of data. So let's think about

[17:59] whether there's a way we can reduce the

[18:01] amount of data that we have to work with

[18:02] to speed up this process. Imagine this.

[18:05] I show you this image right here and you

[18:07] have to tell your friend what's in the

[18:09] image. Are you going to read out all the

[18:11] values of the individual pixels to your

[18:13] friend in order to transfer the

[18:15] information over? No. You'll tell them a

[18:17] description of blue sofa in a white room

[18:20] with a cactus to its right and a coffee

[18:22] table in front. And then they can use

[18:24] their life experience and knowledge of

[18:26] different objects to kind of imagine and

[18:28] reconstruct what it's roughly supposed

[18:30] to look like. It's not going to look

[18:32] exactly the same, but it's going to be

[18:34] good enough. Let's use another example

[18:37] more relevant to computers. Usually, we

[18:39] don't store images as just their raw

[18:41] uncompressed pixel values, and instead

[18:43] we use a file format like JPEG, which

[18:45] can reduce the amount of data by many

[18:47] many times. And then when the file gets

[18:49] decoded to display on your screen, it's

[18:52] a bit lower quality than the original

[18:53] but again, it's good enough. Now, notice

[18:56] how in both of these examples, there's a

[18:58] process of encoding, which is you coming

[19:01] up with a phrase to describe the image

[19:03] or the JPEG compression. And then

[19:05] there's a process of decoding, which is

[19:07] your friend imagining what the image is

[19:09] supposed to look like, or the JPEG

[19:12] decompression. So what people invented

[19:14] is a neural network equivalent of this

[19:17] known as autoenccoders which are

[19:19] basically trained to encode data into

[19:22] what's known as a latent space and then

[19:25] decoded the best that it can back to the

[19:28] original data. Here's a demo of a latent

[19:31] space that's been trained for the amnest

[19:33] digit data set. In this case, the 28x 28

[19:37] equals 784 pixels got encoded into just

[19:42] two numbers, which means we can

[19:43] visualize it as a two-dimensional space

[19:46] and drag around this point to see what

[19:47] the different areas correspond to when

[19:49] it's decoded. Now, of course, if it's

[19:52] five or 10 numbers in the latent space

[19:53] instead of two, you can get higher

[19:55] fidelity reconstructions. In stable

[19:58] diffusion, RGB images which are 512 x

[20:00] 512 corresponding to 786,000

[20:04] numbers are encoded into a latent space

[20:07] that's 4x 64x 64 equals 16,000 numbers.

[20:13] Now that's 150th of the original amount

[20:15] of data. So instead of directly adding

[20:18] noise to images in their pixel space and

[20:21] then dnoising those images, the images

[20:23] are first encoded into this latent space

[20:27] and then we noise and dn noiseise that

[20:30] and then when it's decoded we roughly

[20:32] get the original image again. This is

[20:35] called a latent diffusion model and it's

[20:38] one of the key improvements to the basic

[20:40] diffusion model because it's so many

[20:42] times faster than running dinoising on

[20:45] the raw uncompressed data. Now, up until

[20:47] this point, we still haven't addressed

[20:49] something very important. How do we make

[20:51] it generate images based on a text

[20:54] prompt? Not going to lie, I think this

[20:56] might be where it gets really hard to

[20:58] understand. So, if you don't get

[21:00] anything from here on out, don't stress

[21:01] about it. Because if you made it this

[21:02] far in the video, that's already pretty

[21:04] impressive. But anyways, here's where we

[21:06] use that embedding concept from earlier.

[21:09] All these words, which are discrete

[21:11] variables, have to be encoded into

[21:13] vectors, just like the sequence position

[21:15] numbers representing how noisy the

[21:17] images are. The way that people found

[21:19] good embeddings for words is this method

[21:22] called word tovec. I won't go into too

[21:24] much detail, but basically they had a

[21:27] list of a bunch of vectors, one for each

[21:30] word in the English language. And

[21:32] actually, they had two of these lists.

[21:35] Then they use data of all the text

[21:37] that's ever been written by humans from

[21:39] books and the internet and whatnot to

[21:41] try to adjust these two lists of word

[21:43] vectors so that the vector of a word in

[21:46] one list would be similar to vectors of

[21:49] words that it often appears next to in

[21:52] the other list. Similar meaning it has a

[21:55] larger dot product. So for example, the

[21:57] words tall building are more likely to

[22:00] appear together than the words tall

[22:02] electricity. So if you look at the

[22:04] vector for tall in one list, it'll be

[22:07] similar to the vector for building in

[22:10] the other list, but not similar to the

[22:13] vector for electricity in the other

[22:15] list. So once you've trained it enough

[22:17] the relationship between word vectors in

[22:19] opposite lists is that the more likely

[22:22] they are to appear next to each other

[22:24] the more similar they are. But what does

[22:27] this mean for the relationships between

[22:29] words in the same list? in the same

[22:31] list, the more likely they appear in

[22:34] similar contexts, the more similar they

[22:36] are. So, if we just take one of the

[22:39] lists as the embedding vectors for all

[22:41] the words and graph it out, we'll find

[22:44] that words that are used in similar

[22:46] contexts are grouped closer together.

[22:49] Here's a cool visualization on

[22:51] projector.tensorflow.org

[22:53] where if you click a dot representing a

[22:55] word, it shows you the closest words

[22:57] around it. Of course, it seems as though

[22:59] they're kind of spread out and they're

[23:00] not really the closest words, but that's

[23:03] because this only visualizes three

[23:04] dimensions, while the word vectors

[23:06] actually have 300 dimensions, meaning

[23:09] each word is represented by 300 numbers.

[23:11] With all those dimensions, it turns out

[23:13] these word embeddings can actually

[23:15] capture some of the more nuanced

[23:17] relationships between words. And the

[23:19] most famous example is if you take the

[23:21] vector for king and you subtract the

[23:24] vector for man but then you add the

[23:26] vector for woman you end up with the

[23:30] vector for queen. Another example is you

[23:32] can take London subtract England add

[23:36] Japan and then you end up with Tokyo. So

[23:39] I hope you can see how genius this word

[23:42] embedding vector space is. Now remember

[23:44] how earlier I said there's two types of

[23:46] network layers that are really important

[23:47] to stable diffusion. And the first one

[23:49] is the convolutional layer. Well, it's

[23:52] time to introduce the second one which

[23:54] is called the self attention layer.

[23:56] Let's think back to convolutions for a

[23:58] second. Convolutional layers extract

[24:00] features from an image using

[24:02] relationships between pixels where the

[24:05] amount that each pixel influences

[24:07] another pixel is dependent on their

[24:09] relative spatial position. So, a self-

[24:12] attention layer extracts features from a

[24:14] phrase using the relationships between

[24:15] the words where the amount of influence

[24:18] words have on each other is determined

[24:20] by their embedding vectors. To build up

[24:22] the simplest possible self attention

[24:25] layer, it's kind of like a fully

[24:26] connected layer, but each input and

[24:29] output is a vector instead of a single

[24:30] number. And the weights of connections

[24:32] are not parameters that it tries to

[24:34] learn. Rather the weight of the

[24:37] connection between A and B is determined

[24:40] by the dotproduct between A and B. So in

[24:43] the simplest model the output is

[24:44] entirely dependent on the input since

[24:46] there's no parameters that we can

[24:48] control. But we do want to control it

[24:50] obviously like if we have a convolution

[24:52] where the kernel just has all the same

[24:54] numbers then yeah sure it helps us

[24:56] understand how a convolution works but

[24:58] all it does is just blur the image and

[25:00] it's not that useful. So how can we

[25:02] control this self attention layer so

[25:04] that just like we make a convolution

[25:06] detect edges for example, we make it

[25:09] detect, I don't know, words that negate

[25:11] or emphasize certain adjectives.

[25:14] Let's break this attention process down

[25:16] into its components. In our simple

[25:18] attention layer, the amount that A's

[25:21] input influences B's output is

[25:25] determined by the dot product. And the

[25:26] amount that B's input influences A's

[25:29] output is also determined by the dot

[25:32] product. So it's the same. Now let's

[25:34] focus on the part where A influences B.

[25:37] I'm going to characterize this process

[25:38] as a conversation between A and B, which

[25:42] sounds hella goofy, but I promise it'll

[25:44] make sense. So B goes up to A and says

[25:47] "Hello, I'm B. Here is my ID. show me

[25:51] your ID so that we can compare it and

[25:54] decide how much you influence my output.

[25:57] And then A says, "Yeah, I'm A. Here's my

[25:59] ID and here's my data which I'll pass

[26:03] over to your output after the

[26:04] comparison." So B's ID is called the

[26:07] query vector. A's ID is called the key

[26:11] vector. The process of comparison is the

[26:13] dotproduct between them and A's data is

[26:16] called the value vector. There I just

[26:19] explained query key and value and self

[26:22] attention. In our simple self attention

[26:24] of course, the query is just vector B

[26:26] itself and both the key and the value

[26:29] are just vector A. In other words, each

[26:32] vector has to serve a total of three

[26:34] purposes over the whole process even

[26:37] though it's the same vector the whole

[26:38] time. So here's how we introduce in

[26:40] parameters to control this whole self

[26:42] attention process. How do we manipulate

[26:45] vectors using matrices?

[26:51] So, we have three matrices called, can

[26:54] you guess what they're called?

[26:56] That's right, the query matrix, the key

[26:59] matrix, and the value matrix. And these

[27:01] are applied to each vector before they

[27:04] go to serve their purpose as a query

[27:06] key, or value. Now, the amount that A

[27:09] influences B is no longer the same as

[27:12] the amount that B influences A. This way

[27:15] it's not just similar vectors that

[27:16] influence each other anymore due to the

[27:19] simple dot product. Now we can use any

[27:21] feature in each word's massive 300

[27:25] dimension vector which is what makes

[27:26] self attention layers so powerful in

[27:29] extracting features in the relationships

[27:31] between words. One more thing though

[27:33] right now the output is still not

[27:35] affected by the relative position of

[27:37] each word in a phrase and that's really

[27:39] important in determining the meaning of

[27:41] the phrase. So in order to encode the

[27:43] position of each word, we once again use

[27:45] the positional encoding method covered

[27:47] earlier and just add on those positional

[27:50] embedding vectors to the word embedding

[27:52] vectors. Wow, that was another

[27:54] information overload. You need a break?

[27:58] Okay, let's take a break.

[28:00] All right,, let's, resume., Think, about

[28:02] this. Using convolutional layers, we can

[28:04] encode an image into a small embedding

[28:06] vector. And using attention layers, we

[28:10] can also encode a text phrase into a

[28:12] small embedding vector. But look at the

[28:14] image and the text that I've selected to

[28:16] display here. The caption perfectly

[28:19] describes the image. So imagine if the

[28:22] two encoders could come up with the same

[28:25] embedding vector even though they're

[28:28] dealing with two different types of

[28:29] data. So that's exactly what Open AI

[28:33] tried to do with their clip text model

[28:35] where clip stands for contrastive

[28:38] language image pre-training. This clip

[28:40] text model has both an image encoder and

[28:43] a text encoder and they trained it on

[28:46] 400 million images so that images and

[28:49] captions that match are supposed to come

[28:52] out with very similar embeddings and the

[28:54] ones that don't match are supposed to

[28:56] come out with really different

[28:57] embeddings. So it kind of makes sense

[28:59] that the text embeddings that come out

[29:02] of this clip text model which are

[29:04] already matched to encoded images are

[29:07] perfect to stick into our dnoising unit

[29:10] which also encodes and decodes images as

[29:12] a part of how it works with the

[29:14] convolutions and scaling and all that.

[29:15] So yes in stable diffusion we just take

[29:17] text embeddings generated by clip and

[29:20] inject it into the unit multiple times

[29:23] using attention layers. Well, this time

[29:25] a slightly different type of attention.

[29:28] This time it's not self attention which

[29:31] just operates on one set of input

[29:32] vectors. We're adding the text

[29:34] information into the image. So obviously

[29:36] it's two sets of input data. So instead

[29:39] it's a process called cross attention

[29:42] which is literally like self attention

[29:44] except the image is going to be the

[29:47] query and the text is going to be the

[29:49] key and value. That's it. These cross

[29:52] attention layers in the middle of the

[29:53] unit are going to extract relationships

[29:55] between the image and the text. So that

[29:58] whatever features in the image can get

[30:00] influenced by the most important and

[30:02] relevant features in the text. And this

[30:04] is how we eventually train the network

[30:05] to generate images based on the text

[30:07] captions we give it. So there we have

[30:10] it. Convolutional layers learn images.

[30:12] Self attention layers learn text. And

[30:14] then when you combine the two, you can

[30:16] generate images based on text. Pretty

[30:19] interesting, is it

⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.