[0:00] We live in a world where artists are
[0:02] losing their jobs because you can
[0:03] generate whatever piece of art you want
[0:05] with a simple text prompt within a few
[0:08] seconds that looks incredibly good. More
[0:11] than that, you can generate an image of
[0:13] any thing, even things that don't exist
[0:16] in real life, just by using the right
[0:18] descriptions. What happened? Just last
[0:20] week, I spent 2 hours trying to connect
[0:22] to a wireless printer. How did the
[0:24] computers get here? This video will be
[0:26] highly technical and try to explain how
[0:28] stable diffusion works, which is
[0:29] currently the best method of image
[0:31] generation that we have, beating out
[0:32] older technology like generative
[0:35] adversarial networks or GANs. Now
[0:37] you've seen the length of the video.
[0:39] It's a long video, but if you go search
[0:40] up other machine learning videos online
[0:42] this will probably still be the least
[0:44] technical out of all of them because
[0:46] I've tried to cut out all the math to
[0:48] make everything conceptually easier to
[0:50] understand while keeping the information
[0:52] mostly accurate. Look, I know you're
[0:54] passionate and curious about this
[0:55] technology. So, I want you to try to pay
[0:57] attention. Even though, if I'm going to
[0:58] be honest, you probably won't understand
[1:00] a lot of it the first time you watch it
[1:02] through. If you can grasp the intuition
[1:03] and concepts well, and then you decide
[1:05] to look more at the math and derivations
[1:07] or to pursue a career in this field
[1:10] everything will be a lot easier to
[1:11] understand. And that's part of why I
[1:13] made this video. AI is the future. And
[1:15] you know what they say, when there's a
[1:17] gold rush, sell shovels. Now, a lot of
[1:19] people are worried about AI safety
[1:20] thinking AI is going to take over the
[1:22] world. Me personally, I'm not too
[1:24] pressed because I can't even get Chad
[1:26] GBT to solve a simple math problem
[1:28] correctly. But what I can say you should
[1:29] be worried about is cyber security
[1:31] which our video partner today, NordVPN
[1:34] wants you to learn about. The process of
[1:35] making this video involved a lot of
[1:36] research and developing neural networks
[1:38] on Google Collab, all of which uses the
[1:41] internet, and sometimes I'd be away from
[1:42] home at a public library or cafe using
[1:45] the free Wi-Fi. Now, you'd be surprised
[1:47] how easy it is to compromise these
[1:49] public networks or make a fake network
[1:51] just to steal your data. And this is
[1:53] called a man-in-the-middle attack. So
[1:55] to make sure my bank account information
[1:56] and passwords don't get stolen by a
[1:58] hacker, I use NordVPN to make sure my
[2:00] internet connection was securely
[2:02] encrypted at all times. And other than
[2:04] that, NordVPN also has a bunch of other
[2:06] features like their threat protection
[2:07] and dark web monitor features to protect
[2:09] you against fishing attacks, password
[2:12] leaks, malware, ransomware, and the
[2:14] like. I mainly use NordVPN for security
[2:16] but sometimes I can also use it to watch
[2:17] shows that are only available in certain
[2:19] other countries or get plane tickets for
[2:21] cheaper prices in other areas of the
[2:23] world. For example, if I want to connect
[2:25] to Japan's internet, I just click on
[2:27] Tokyo and I'm there. So, NordVPN is
[2:30] offering an exclusive deal if you go to
[2:32] nordvpn.com/gonkey
[2:34] where you can get a 2-year plan with
[2:35] extra months for free with a 30-day
[2:37] money back guarantee. I'll leave that
[2:39] link in the description and pinned
[2:40] comment below. Again, that's
[2:42] nordvpn.com/gonkey.
[2:44] Now, let's get started. Deep learning is
[2:46] all about neural networks. So, of
[2:48] course, there's many different ways that
[2:49] neurons can be connected to each other.
[2:51] The most basic type of neural network
[2:53] consists of what are known as fully
[2:55] connected layers, where every neuron in
[2:57] each layer is connected to every neuron
[3:00] in the next layer. But in this video
[3:02] we'll find that the process of image
[3:04] generation with stable diffusion is
[3:06] largely dependent on two special types
[3:08] of network layers, each serving a very
[3:11] important role. Here, I'm going to
[3:13] introduce the first one called the
[3:14] convolutional layer. Pay attention
[3:16] because the second type of layer will
[3:18] come along much later in the video. And
[3:19] the way that it relates to convolutional
[3:22] layers is kind of amazing. You see
[3:23] basic fully connected layers work well
[3:26] for many different types of data, but
[3:27] not images because images have way too
[3:30] many pixels. Imagine you wanted to do an
[3:32] operation on a 100x100 image, which
[3:35] outputs a new 100x100 image. Even if the
[3:39] image was black and white and only had
[3:40] one channel, that's 100 * 100 equals
[3:43] 10,000 pixels. So, it's 10,000 inputs
[3:46] and 10,000 outputs, which means there's
[3:49] 10,000 * 10,000 equals 100 million
[3:52] neuron connections just for a 100* 100
[3:55] image. And in addition, in a fully
[3:56] connected layer, each input contributes
[3:59] equally to each output. Which means the
[4:01] relative spatial position of each pixel
[4:03] is irrelevant, which kind of doesn't
[4:06] make sense because obviously in an
[4:07] image, pixels that are closer to each
[4:09] other are more important in making up
[4:11] features such as an edge compared to two
[4:13] random pixels that are really far away
[4:14] from each other. So for images, a better
[4:16] type of layer is the convolutional layer
[4:19] where each output pixel is determined by
[4:21] a grid of all the surrounding input
[4:23] pixels. And this is done with a 2D grid
[4:26] of numbers called a kernel. Usually with
[4:28] a size of like 3x3 or 5x5 where the
[4:32] output pixel is determined by
[4:34] multiplying the surrounding input pixels
[4:36] by the corresponding number in the
[4:37] kernel and then adding everything up.
[4:40] For example, here's a vertical edge
[4:42] detection kernel and here's a horizontal
[4:45] edge detection kernel. You should start
[4:47] to see why convolutions work so well for
[4:49] images. If we have a 5x5 kernel instead
[4:51] of a fully connected layer for a 100 by
[4:53] 100 image, that's only 25 parameters
[4:56] that we can reuse over and over again
[4:58] instead of 100 million. So now that you
[5:00] understand how convolutions work, it's
[5:02] time to talk about its significance to
[5:04] computer vision, which is basically the
[5:06] field of identifying what's in an image.
[5:10] Level one of computer vision is simply
[5:12] image classification, where the network
[5:14] just labels what is in an image. we have
[5:17] to assume there's only one object in the
[5:19] image and we don't know where exactly it
[5:21] is but we know what it is. Now level two
[5:24] is classification with localization
[5:27] where we can also only have one object
[5:29] but the network also gives us a bounding
[5:31] box which tells us where it is in the
[5:33] image. Level three is object detection.
[5:36] So now the image can have multiple
[5:38] objects and we get multiple bounding
[5:40] boxes and labels around each of them.
[5:42] But it's still a very rough estimate of
[5:44] what pixels in the image is that object
[5:46] because all the bounding boxes are just
[5:48] rectangles. So it's at level four which
[5:51] is semantic segmentation that each pixel
[5:54] in the image gets labeled for what it
[5:56] is. Now we can have the exact shape of
[5:58] whatever it is in the image that we want
[6:00] to identify and this is good for things
[6:02] like background removal. Level five is
[6:04] instance segmentation where not only
[6:07] does the program classify what thing
[6:08] each pixel is, it can also identify
[6:11] multiple instances of that thing. Like
[6:13] if there's multiple people in a picture
[6:16] the invention of stable diffusion starts
[6:18] with level four semantic segmentation
[6:21] and specifically for biomemed images. So
[6:24] we're talking about images of cells
[6:26] neurons, blood vessels, organs and
[6:28] whatnot. This is helpful for uh
[6:30] diagnosing diseases, researching anatomy
[6:33] stuff. Okay, I'm going to be honest.
[6:35] It's not that important why biomedical
[6:37] image segmentation is important. All
[6:38] that you need to know is that people
[6:40] were trying to segment images of cells.
[6:42] And if you're thinking, what on earth
[6:43] does this have to do with image
[6:45] generation? Well, I promise you it's
[6:47] going to make sense in a bit. And that's
[6:48] when the genius comes. For a while
[6:50] image segmentation was inefficient and
[6:52] required thousands and thousands of
[6:54] training samples. for biomedical image
[6:57] tasks. A lot of times there weren't
[6:58] enough, images., Or, at least, that, was, the
[7:00] case until 2015 when a group of computer
[7:03] scientists submitted a paper proposing a
[7:05] new network architecture which would
[7:07] then go on to be cited by over 60,000
[7:10] people. This is definitely one of the
[7:12] more influential breakthroughs in
[7:14] machine learning. Let's talk about the
[7:16] unit.
[7:18] A unit is full of convolutional layers
[7:20] in order to do semantic segmentation on
[7:22] cell images. But it's kind of weird in
[7:24] how it does it because it first scales
[7:26] down the image to a really low
[7:28] resolution and then it scales it back up
[7:31] to its original resolution. That sounds
[7:33] counterintuitive at first, but it's kind
[7:35] of genius. And I'm going to demonstrate
[7:37] with a unit that I wrote myself. I wrote
[7:39] this unit for this fish data set on
[7:41] Kaggle from which I acquired 500 images
[7:44] of different fish at a fish market along
[7:47] with their corresponding black and white
[7:48] masks of what pixels in the image make
[7:51] up the fish. If you don't know, Kaggle
[7:53] is a website dedicated to data science
[7:55] and machine learning. And I'll leave a
[7:56] link to the data set in the description.
[7:58] So, the black and white masks are what's
[8:00] known as the ground truth because that's
[8:02] what we are comparing the network's
[8:04] outputs against and training the network
[8:06] to try to achieve. At first, the unit's
[8:09] output when you give it this fish is not
[8:11] so meaningful, but after a bit of
[8:13] training,
[8:18] it's able to identify the shape a lot
[8:20] better. The colors are like this because
[8:22] the values are outside the range of 0 to
[8:25] 1. So the image is rendered in what's
[8:26] known as pseudo color. But if we clamp
[8:29] it to between 0 and one, the final
[8:31] output is essentially the same as the
[8:32] provided masks. Now that we have a
[8:34] trained network, it's time to open it up
[8:36] to see what's inside and figure out how
[8:38] does a unit segment images so
[8:40] efficiently. Remember, prior methods
[8:42] required thousands of sample images, but
[8:44] I've only given this one 500 images and
[8:47] it's doing pretty well. So when this RGB
[8:49] image of a fish gets inputed into the
[8:51] unit, it's represented in computer
[8:53] memory as a 3D grid of numbers because
[8:56] it has a width, height, and three
[8:58] channels. So this is a three-dimensional
[9:00] tensor in machine learning language. Now
[9:03] at the start, the image only has three
[9:05] channels to represent redness
[9:07] greenness, and bless. But what if it
[9:10] could have more channels to represent
[9:12] more information like what part of the
[9:15] image corresponds to the body of the
[9:17] fish, what part is the cutting board
[9:19] what parts the shadow, what parts the
[9:20] highlights, and so on. So that's
[9:22] essentially the whole point of
[9:23] convolutions. It's to extract features
[9:26] from an image from how the pixels relate
[9:28] to each other. And what makes
[9:30] convolutions even more powerful is when
[9:32] there's more channels in the image than
[9:34] just one channel, because then the
[9:36] kernel is a 3D grid instead of just a 2D
[9:38] one.
[9:40] The first half of the unit has all these
[9:41] convolutional blocks that makes the
[9:43] number of channels in the image go from
[9:45] 3 to 64 to 128 to 256 to 512 and finally
[9:51] to 1,024
[9:53] in the convolution from 64 to 128
[9:56] channels. For example, each kernel is 64
[9:59] layers deep and there's 128 of those
[10:01] kernels. That's how the network can
[10:04] extract more and more complex features
[10:06] from the image. Slight issue though.
[10:08] Even though the kernels get deeper and
[10:09] deeper, they still have a fixed field of
[10:12] view on the image. In this case, a 3x3
[10:15] field of view. In order to better
[10:17] extract features from the image
[10:18] obviously the kernels are going to have
[10:20] to see more of the image. So, how can we
[10:22] make the field of view bigger? Well
[10:25] just making the kernels bigger rapidly
[10:27] increases the number of parameters that
[10:29] we have, which makes it inefficient. So
[10:30] the unit uses a really smart and
[10:32] efficient alternative. If we can't make
[10:35] the kernels bigger, then just make the
[10:38] image smaller. So after every two
[10:40] convolutional blocks, the image gets
[10:42] scaled down before it goes into the next
[10:44] two convolutional blocks. This increased
[10:46] field of view is how the network can
[10:48] capture more context within the image to
[10:51] better understand it. So let's see what
[10:53] our fish has turned into in the middle
[10:55] of the unit, where there's the most
[10:56] number of channels, but the resolution
[10:58] is the smallest. Out of the 124
[11:01] channels, we can see that some of them
[11:03] highlight the body of the fish, some of
[11:05] them the background, some of them
[11:07] highlight the brighter area above the
[11:08] fish, and some of them the darker area
[11:11] below. Just as we said before. So, at
[11:14] this point, the network has learned all
[11:16] the information on what is in the image
[11:18] but the downscaling has made it lose
[11:20] information on where it is in the image.
[11:23] So in the second half of the unit, we
[11:25] start scaling it back up again and
[11:27] decrease the number of channels using
[11:28] these convolutional blocks to kind of
[11:31] consolidate and summarize up all that
[11:33] information that we gathered in the
[11:34] first half. But how do we get back all
[11:36] the lost detail from the downsampling in
[11:38] the first half? The answer is what's
[11:40] known as residual connections where
[11:42] every time the resolution is increased
[11:45] the information from the previous time
[11:47] the image was that resolution is
[11:49] literally just slapped onto the back and
[11:50] combined with it. And then the
[11:52] convolutional layers mix the information
[11:54] back in. If we compare the fish image at
[11:56] its highest resolution in the beginning
[11:59] to where it's at its highest resolution
[12:01] in the end, we can see that the
[12:02] different parts of the image are much
[12:04] better segmented. And that's how through
[12:07] one final convolution, we get this very
[12:09] clean mask. Yeah. So, units are really
[12:13] good at segmenting images. There was
[12:14] this international image segmenting
[12:16] competition where the people who
[12:18] invented the unit just went in there and
[12:20] demolished everyone. Here's them getting
[12:21] the award for it. What a bunch of nerds
[12:23] to be honest. No, I'm just kidding. I
[12:25] mean that in an endearing way. But
[12:27] anyways, okay. When are we actually
[12:29] going to get to the image generation?
[12:31] We're getting there. Listen up. The UNET
[12:33] is so good at identifying things within
[12:36] an image that people started using it
[12:38] for other stuff other than semantic
[12:40] segmentation. Specifically, it could be
[12:42] used to dn noiseise an image. If a noisy
[12:44] image is just the sum of the original
[12:46] image plus some noise, then if you
[12:49] identify the noise in the image, then
[12:52] you can just minus it away to get the
[12:54] original. In fact, that's exactly what
[12:56] we're going to try to do. So, allow me
[12:58] to demonstrate with another image of a
[13:00] fish. This time with a resolution of 64x
[13:03] 64. This time, there is no black and
[13:06] white ground truth mask to go with it.
[13:09] Instead, we generate a bunch of noise to
[13:10] be our ground truth because that's what
[13:13] we're trying to train the network to
[13:14] identify. It's important that during
[13:16] training, we train on many copies of the
[13:18] image with different amounts of noise
[13:20] added in so that it's able to dn
[13:22] noiseise really noisy images as well as
[13:25] not so noisy ones. And here's where an
[13:28] interesting challenge arises. How do we
[13:30] provide the network with the knowledge
[13:32] of how noisy each image sample is?
[13:35] Because that's obviously going to affect
[13:36] the outcome. So if you imagine all the
[13:39] possible noise levels placed in a
[13:41] sequence, the information here of how
[13:44] noisy any sample is is basically a
[13:47] number of that sample's position in that
[13:50] sequence. So this is called positional
[13:52] encoding. So let's say for this
[13:54] particular image, its noise level
[13:56] corresponds to the 10th position in the
[13:58] sequence. Now we've got a 64x 64 image
[14:01] with three channels, meaning there's
[14:03] 12,288
[14:05] numbers in total. Do we just slap a 10
[14:08] on the end making it 12,289 numbers? Is
[14:11] that going to work? No. So, here's how
[14:13] positional encoding works. And I get it.
[14:15] You might be thinking, okay, this seems
[14:17] like not such a significant detail. Why
[14:20] do we need to go through it? This might
[14:21] be like the fifth time I've said this
[14:23] but it's going to come up later again.
[14:25] It's going to be important. Positional
[14:26] encoding is a type of embedding which is
[14:29] when you take discrete variables like
[14:32] words, hint later on, or in this case
[14:35] positions in a sequence and turn it into
[14:38] a vector of continuous numbers to feed
[14:41] to the network as a more digestible form
[14:44] of information that it can then use. The
[14:47] way that our 10 gets converted into a
[14:49] vector of continuous numbers is using
[14:51] these sign and cause equations here. So
[14:53] that the vector of numbers always stays
[14:55] within a fixed range. But each position
[14:57] is encoded by a unique combination of
[15:00] numbers in the vector since the
[15:02] different elements of the vector are
[15:03] given by sign and cos functions of
[15:05] different frequencies. And then it gets
[15:07] added onto the image data repeatedly at
[15:10] every point in the unit where it changes
[15:12] resolution to really drill in the
[15:14] information of how much noise is in the
[15:17] image just to help the network get it
[15:19] right. Okay, that was an information
[15:22] overload, but we can finally start the
[15:24] training process. And you can see that
[15:25] at first the noise that the network
[15:27] predicts is obviously way off. It looks
[15:29] nothing like the actual noise that we
[15:30] gave it. But after a while, it looks
[15:32] pretty similar to the ground truth
[15:33] noise. And we can't really notice any
[15:36] improvements anymore. So let's instead
[15:38] show the dnoised version where we minus
[15:41] this prediction to get an image and see
[15:44] how that improves.
[15:50] So yes, as you can see, we do end up
[15:53] with the original fish image, but it is
[15:55] kind of low quality and blurry. This is
[15:57] because trying to go from pure noise to
[16:00] the original image all in one step is
[16:02] too hard. So instead, what we're going
[16:04] to do is we don't get rid of all the
[16:07] noise at once. We only get rid of some
[16:09] of it. And then we feed it back into the
[16:11] network again and get rid of a little
[16:13] bit more. And then again, and then
[16:15] again. And as you can see, removing the
[16:17] noise in small baby steps like this
[16:19] eventually gives us the clear, high
[16:22] quality original image. And this is why
[16:24] we've had to feed the network images of
[16:26] varying degrees of noise to train on
[16:28] because that's how it can work for the
[16:30] whole process of dnoising.
[16:33] Now, it's obvious that the network's
[16:34] going to give us the same fish all the
[16:36] time because we only trained it on one
[16:38] image. But allow me to demonstrate what
[16:40] happens when I take 5,000 32x32 images
[16:44] of ships from the famous Sciphar 10 data
[16:47] set. The Sciphar 10 data set has 10
[16:50] classes corresponding to 10 different
[16:52] objects, each with thousands of 32x32
[16:55] images. Now, I'm not going to lie, at
[16:57] first I accidentally put all 10 classes
[16:59] of images into the network, so it was
[17:01] going 10 times slower than it needed to
[17:02] be, and I stopped it early. But then I
[17:04] looked at some of the results and while
[17:06] most of it was nonsense, here is what I
[17:09] believe to be a red panda wearing
[17:11] sunglasses and using a green tent as a
[17:14] turtle shell. And this I think is an
[17:17] orange boat wearing ice skating shoes
[17:20] with a mohawk.
[17:22] I don't know how this happens. Maybe
[17:23] that's the magic of AI. But anyways, I
[17:25] started training it again on the ship
[17:27] images. And at first it was giving us
[17:29] rubbish. But eventually we can see that
[17:30] it comes up with completely new images
[17:32] of ships using the knowledge that it's
[17:34] gathered. And that my friend is called a
[17:38] diffusion model. Now maybe it's come to
[17:40] your attention that what we have so far
[17:42] is not very efficient. I mean it took
[17:44] forever to train on these 32x32 images.
[17:47] Imagine how long it's going to take for
[17:49] an HD or a 4K image. And the reason is
[17:51] we're doing the noise prediction and
[17:53] dnoising directly on the pixels right
[17:55] now. And there's a lot of pixels meaning
[17:57] a lot of data. So let's think about
[17:59] whether there's a way we can reduce the
[18:01] amount of data that we have to work with
[18:02] to speed up this process. Imagine this.
[18:05] I show you this image right here and you
[18:07] have to tell your friend what's in the
[18:09] image. Are you going to read out all the
[18:11] values of the individual pixels to your
[18:13] friend in order to transfer the
[18:15] information over? No. You'll tell them a
[18:17] description of blue sofa in a white room
[18:20] with a cactus to its right and a coffee
[18:22] table in front. And then they can use
[18:24] their life experience and knowledge of
[18:26] different objects to kind of imagine and
[18:28] reconstruct what it's roughly supposed
[18:30] to look like. It's not going to look
[18:32] exactly the same, but it's going to be
[18:34] good enough. Let's use another example
[18:37] more relevant to computers. Usually, we
[18:39] don't store images as just their raw
[18:41] uncompressed pixel values, and instead
[18:43] we use a file format like JPEG, which
[18:45] can reduce the amount of data by many
[18:47] many times. And then when the file gets
[18:49] decoded to display on your screen, it's
[18:52] a bit lower quality than the original
[18:53] but again, it's good enough. Now, notice
[18:56] how in both of these examples, there's a
[18:58] process of encoding, which is you coming
[19:01] up with a phrase to describe the image
[19:03] or the JPEG compression. And then
[19:05] there's a process of decoding, which is
[19:07] your friend imagining what the image is
[19:09] supposed to look like, or the JPEG
[19:12] decompression. So what people invented
[19:14] is a neural network equivalent of this
[19:17] known as autoenccoders which are
[19:19] basically trained to encode data into
[19:22] what's known as a latent space and then
[19:25] decoded the best that it can back to the
[19:28] original data. Here's a demo of a latent
[19:31] space that's been trained for the amnest
[19:33] digit data set. In this case, the 28x 28
[19:37] equals 784 pixels got encoded into just
[19:42] two numbers, which means we can
[19:43] visualize it as a two-dimensional space
[19:46] and drag around this point to see what
[19:47] the different areas correspond to when
[19:49] it's decoded. Now, of course, if it's
[19:52] five or 10 numbers in the latent space
[19:53] instead of two, you can get higher
[19:55] fidelity reconstructions. In stable
[19:58] diffusion, RGB images which are 512 x
[20:00] 512 corresponding to 786,000
[20:04] numbers are encoded into a latent space
[20:07] that's 4x 64x 64 equals 16,000 numbers.
[20:13] Now that's 150th of the original amount
[20:15] of data. So instead of directly adding
[20:18] noise to images in their pixel space and
[20:21] then dnoising those images, the images
[20:23] are first encoded into this latent space
[20:27] and then we noise and dn noiseise that
[20:30] and then when it's decoded we roughly
[20:32] get the original image again. This is
[20:35] called a latent diffusion model and it's
[20:38] one of the key improvements to the basic
[20:40] diffusion model because it's so many
[20:42] times faster than running dinoising on
[20:45] the raw uncompressed data. Now, up until
[20:47] this point, we still haven't addressed
[20:49] something very important. How do we make
[20:51] it generate images based on a text
[20:54] prompt? Not going to lie, I think this
[20:56] might be where it gets really hard to
[20:58] understand. So, if you don't get
[21:00] anything from here on out, don't stress
[21:01] about it. Because if you made it this
[21:02] far in the video, that's already pretty
[21:04] impressive. But anyways, here's where we
[21:06] use that embedding concept from earlier.
[21:09] All these words, which are discrete
[21:11] variables, have to be encoded into
[21:13] vectors, just like the sequence position
[21:15] numbers representing how noisy the
[21:17] images are. The way that people found
[21:19] good embeddings for words is this method
[21:22] called word tovec. I won't go into too
[21:24] much detail, but basically they had a
[21:27] list of a bunch of vectors, one for each
[21:30] word in the English language. And
[21:32] actually, they had two of these lists.
[21:35] Then they use data of all the text
[21:37] that's ever been written by humans from
[21:39] books and the internet and whatnot to
[21:41] try to adjust these two lists of word
[21:43] vectors so that the vector of a word in
[21:46] one list would be similar to vectors of
[21:49] words that it often appears next to in
[21:52] the other list. Similar meaning it has a
[21:55] larger dot product. So for example, the
[21:57] words tall building are more likely to
[22:00] appear together than the words tall
[22:02] electricity. So if you look at the
[22:04] vector for tall in one list, it'll be
[22:07] similar to the vector for building in
[22:10] the other list, but not similar to the
[22:13] vector for electricity in the other
[22:15] list. So once you've trained it enough
[22:17] the relationship between word vectors in
[22:19] opposite lists is that the more likely
[22:22] they are to appear next to each other
[22:24] the more similar they are. But what does
[22:27] this mean for the relationships between
[22:29] words in the same list? in the same
[22:31] list, the more likely they appear in
[22:34] similar contexts, the more similar they
[22:36] are. So, if we just take one of the
[22:39] lists as the embedding vectors for all
[22:41] the words and graph it out, we'll find
[22:44] that words that are used in similar
[22:46] contexts are grouped closer together.
[22:49] Here's a cool visualization on
[22:51] projector.tensorflow.org
[22:53] where if you click a dot representing a
[22:55] word, it shows you the closest words
[22:57] around it. Of course, it seems as though
[22:59] they're kind of spread out and they're
[23:00] not really the closest words, but that's
[23:03] because this only visualizes three
[23:04] dimensions, while the word vectors
[23:06] actually have 300 dimensions, meaning
[23:09] each word is represented by 300 numbers.
[23:11] With all those dimensions, it turns out
[23:13] these word embeddings can actually
[23:15] capture some of the more nuanced
[23:17] relationships between words. And the
[23:19] most famous example is if you take the
[23:21] vector for king and you subtract the
[23:24] vector for man but then you add the
[23:26] vector for woman you end up with the
[23:30] vector for queen. Another example is you
[23:32] can take London subtract England add
[23:36] Japan and then you end up with Tokyo. So
[23:39] I hope you can see how genius this word
[23:42] embedding vector space is. Now remember
[23:44] how earlier I said there's two types of
[23:46] network layers that are really important
[23:47] to stable diffusion. And the first one
[23:49] is the convolutional layer. Well, it's
[23:52] time to introduce the second one which
[23:54] is called the self attention layer.
[23:56] Let's think back to convolutions for a
[23:58] second. Convolutional layers extract
[24:00] features from an image using
[24:02] relationships between pixels where the
[24:05] amount that each pixel influences
[24:07] another pixel is dependent on their
[24:09] relative spatial position. So, a self-
[24:12] attention layer extracts features from a
[24:14] phrase using the relationships between
[24:15] the words where the amount of influence
[24:18] words have on each other is determined
[24:20] by their embedding vectors. To build up
[24:22] the simplest possible self attention
[24:25] layer, it's kind of like a fully
[24:26] connected layer, but each input and
[24:29] output is a vector instead of a single
[24:30] number. And the weights of connections
[24:32] are not parameters that it tries to
[24:34] learn. Rather the weight of the
[24:37] connection between A and B is determined
[24:40] by the dotproduct between A and B. So in
[24:43] the simplest model the output is
[24:44] entirely dependent on the input since
[24:46] there's no parameters that we can
[24:48] control. But we do want to control it
[24:50] obviously like if we have a convolution
[24:52] where the kernel just has all the same
[24:54] numbers then yeah sure it helps us
[24:56] understand how a convolution works but
[24:58] all it does is just blur the image and
[25:00] it's not that useful. So how can we
[25:02] control this self attention layer so
[25:04] that just like we make a convolution
[25:06] detect edges for example, we make it
[25:09] detect, I don't know, words that negate
[25:11] or emphasize certain adjectives.
[25:14] Let's break this attention process down
[25:16] into its components. In our simple
[25:18] attention layer, the amount that A's
[25:21] input influences B's output is
[25:25] determined by the dot product. And the
[25:26] amount that B's input influences A's
[25:29] output is also determined by the dot
[25:32] product. So it's the same. Now let's
[25:34] focus on the part where A influences B.
[25:37] I'm going to characterize this process
[25:38] as a conversation between A and B, which
[25:42] sounds hella goofy, but I promise it'll
[25:44] make sense. So B goes up to A and says
[25:47] "Hello, I'm B. Here is my ID. show me
[25:51] your ID so that we can compare it and
[25:54] decide how much you influence my output.
[25:57] And then A says, "Yeah, I'm A. Here's my
[25:59] ID and here's my data which I'll pass
[26:03] over to your output after the
[26:04] comparison." So B's ID is called the
[26:07] query vector. A's ID is called the key
[26:11] vector. The process of comparison is the
[26:13] dotproduct between them and A's data is
[26:16] called the value vector. There I just
[26:19] explained query key and value and self
[26:22] attention. In our simple self attention
[26:24] of course, the query is just vector B
[26:26] itself and both the key and the value
[26:29] are just vector A. In other words, each
[26:32] vector has to serve a total of three
[26:34] purposes over the whole process even
[26:37] though it's the same vector the whole
[26:38] time. So here's how we introduce in
[26:40] parameters to control this whole self
[26:42] attention process. How do we manipulate
[26:45] vectors using matrices?
[26:51] So, we have three matrices called, can
[26:54] you guess what they're called?
[26:56] That's right, the query matrix, the key
[26:59] matrix, and the value matrix. And these
[27:01] are applied to each vector before they
[27:04] go to serve their purpose as a query
[27:06] key, or value. Now, the amount that A
[27:09] influences B is no longer the same as
[27:12] the amount that B influences A. This way
[27:15] it's not just similar vectors that
[27:16] influence each other anymore due to the
[27:19] simple dot product. Now we can use any
[27:21] feature in each word's massive 300
[27:25] dimension vector which is what makes
[27:26] self attention layers so powerful in
[27:29] extracting features in the relationships
[27:31] between words. One more thing though
[27:33] right now the output is still not
[27:35] affected by the relative position of
[27:37] each word in a phrase and that's really
[27:39] important in determining the meaning of
[27:41] the phrase. So in order to encode the
[27:43] position of each word, we once again use
[27:45] the positional encoding method covered
[27:47] earlier and just add on those positional
[27:50] embedding vectors to the word embedding
[27:52] vectors. Wow, that was another
[27:54] information overload. You need a break?
[27:58] Okay, let's take a break.
[28:00] All right,, let's, resume., Think, about
[28:02] this. Using convolutional layers, we can
[28:04] encode an image into a small embedding
[28:06] vector. And using attention layers, we
[28:10] can also encode a text phrase into a
[28:12] small embedding vector. But look at the
[28:14] image and the text that I've selected to
[28:16] display here. The caption perfectly
[28:19] describes the image. So imagine if the
[28:22] two encoders could come up with the same
[28:25] embedding vector even though they're
[28:28] dealing with two different types of
[28:29] data. So that's exactly what Open AI
[28:33] tried to do with their clip text model
[28:35] where clip stands for contrastive
[28:38] language image pre-training. This clip
[28:40] text model has both an image encoder and
[28:43] a text encoder and they trained it on
[28:46] 400 million images so that images and
[28:49] captions that match are supposed to come
[28:52] out with very similar embeddings and the
[28:54] ones that don't match are supposed to
[28:56] come out with really different
[28:57] embeddings. So it kind of makes sense
[28:59] that the text embeddings that come out
[29:02] of this clip text model which are
[29:04] already matched to encoded images are
[29:07] perfect to stick into our dnoising unit
[29:10] which also encodes and decodes images as
[29:12] a part of how it works with the
[29:14] convolutions and scaling and all that.
[29:15] So yes in stable diffusion we just take
[29:17] text embeddings generated by clip and
[29:20] inject it into the unit multiple times
[29:23] using attention layers. Well, this time
[29:25] a slightly different type of attention.
[29:28] This time it's not self attention which
[29:31] just operates on one set of input
[29:32] vectors. We're adding the text
[29:34] information into the image. So obviously
[29:36] it's two sets of input data. So instead
[29:39] it's a process called cross attention
[29:42] which is literally like self attention
[29:44] except the image is going to be the
[29:47] query and the text is going to be the
[29:49] key and value. That's it. These cross
[29:52] attention layers in the middle of the
[29:53] unit are going to extract relationships
[29:55] between the image and the text. So that
[29:58] whatever features in the image can get
[30:00] influenced by the most important and
[30:02] relevant features in the text. And this
[30:04] is how we eventually train the network
[30:05] to generate images based on the text
[30:07] captions we give it. So there we have
[30:10] it. Convolutional layers learn images.
[30:12] Self attention layers learn text. And
[30:14] then when you combine the two, you can
[30:16] generate images based on text. Pretty
[30:19] interesting, is it