[0:00] We live in a world where artists are [0:02] losing their jobs because you can [0:03] generate whatever piece of art you want [0:05] with a simple text prompt within a few [0:08] seconds that looks incredibly good. More [0:11] than that, you can generate an image of [0:13] any thing, even things that don't exist [0:16] in real life, just by using the right [0:18] descriptions. What happened? Just last [0:20] week, I spent 2 hours trying to connect [0:22] to a wireless printer. How did the [0:24] computers get here? This video will be [0:26] highly technical and try to explain how [0:28] stable diffusion works, which is [0:29] currently the best method of image [0:31] generation that we have, beating out [0:32] older technology like generative [0:35] adversarial networks or GANs. Now [0:37] you've seen the length of the video. [0:39] It's a long video, but if you go search [0:40] up other machine learning videos online [0:42] this will probably still be the least [0:44] technical out of all of them because [0:46] I've tried to cut out all the math to [0:48] make everything conceptually easier to [0:50] understand while keeping the information [0:52] mostly accurate. Look, I know you're [0:54] passionate and curious about this [0:55] technology. So, I want you to try to pay [0:57] attention. Even though, if I'm going to [0:58] be honest, you probably won't understand [1:00] a lot of it the first time you watch it [1:02] through. If you can grasp the intuition [1:03] and concepts well, and then you decide [1:05] to look more at the math and derivations [1:07] or to pursue a career in this field [1:10] everything will be a lot easier to [1:11] understand. And that's part of why I [1:13] made this video. AI is the future. And [1:15] you know what they say, when there's a [1:17] gold rush, sell shovels. Now, a lot of [1:19] people are worried about AI safety [1:20] thinking AI is going to take over the [1:22] world. Me personally, I'm not too [1:24] pressed because I can't even get Chad [1:26] GBT to solve a simple math problem [1:28] correctly. But what I can say you should [1:29] be worried about is cyber security [1:31] which our video partner today, NordVPN [1:34] wants you to learn about. The process of [1:35] making this video involved a lot of [1:36] research and developing neural networks [1:38] on Google Collab, all of which uses the [1:41] internet, and sometimes I'd be away from [1:42] home at a public library or cafe using [1:45] the free Wi-Fi. Now, you'd be surprised [1:47] how easy it is to compromise these [1:49] public networks or make a fake network [1:51] just to steal your data. And this is [1:53] called a man-in-the-middle attack. So [1:55] to make sure my bank account information [1:56] and passwords don't get stolen by a [1:58] hacker, I use NordVPN to make sure my [2:00] internet connection was securely [2:02] encrypted at all times. And other than [2:04] that, NordVPN also has a bunch of other [2:06] features like their threat protection [2:07] and dark web monitor features to protect [2:09] you against fishing attacks, password [2:12] leaks, malware, ransomware, and the [2:14] like. I mainly use NordVPN for security [2:16] but sometimes I can also use it to watch [2:17] shows that are only available in certain [2:19] other countries or get plane tickets for [2:21] cheaper prices in other areas of the [2:23] world. For example, if I want to connect [2:25] to Japan's internet, I just click on [2:27] Tokyo and I'm there. So, NordVPN is [2:30] offering an exclusive deal if you go to [2:32] nordvpn.com/gonkey [2:34] where you can get a 2-year plan with [2:35] extra months for free with a 30-day [2:37] money back guarantee. I'll leave that [2:39] link in the description and pinned [2:40] comment below. Again, that's [2:42] nordvpn.com/gonkey. [2:44] Now, let's get started. Deep learning is [2:46] all about neural networks. So, of [2:48] course, there's many different ways that [2:49] neurons can be connected to each other. [2:51] The most basic type of neural network [2:53] consists of what are known as fully [2:55] connected layers, where every neuron in [2:57] each layer is connected to every neuron [3:00] in the next layer. But in this video [3:02] we'll find that the process of image [3:04] generation with stable diffusion is [3:06] largely dependent on two special types [3:08] of network layers, each serving a very [3:11] important role. Here, I'm going to [3:13] introduce the first one called the [3:14] convolutional layer. Pay attention [3:16] because the second type of layer will [3:18] come along much later in the video. And [3:19] the way that it relates to convolutional [3:22] layers is kind of amazing. You see [3:23] basic fully connected layers work well [3:26] for many different types of data, but [3:27] not images because images have way too [3:30] many pixels. Imagine you wanted to do an [3:32] operation on a 100x100 image, which [3:35] outputs a new 100x100 image. Even if the [3:39] image was black and white and only had [3:40] one channel, that's 100 * 100 equals [3:43] 10,000 pixels. So, it's 10,000 inputs [3:46] and 10,000 outputs, which means there's [3:49] 10,000 * 10,000 equals 100 million [3:52] neuron connections just for a 100* 100 [3:55] image. And in addition, in a fully [3:56] connected layer, each input contributes [3:59] equally to each output. Which means the [4:01] relative spatial position of each pixel [4:03] is irrelevant, which kind of doesn't [4:06] make sense because obviously in an [4:07] image, pixels that are closer to each [4:09] other are more important in making up [4:11] features such as an edge compared to two [4:13] random pixels that are really far away [4:14] from each other. So for images, a better [4:16] type of layer is the convolutional layer [4:19] where each output pixel is determined by [4:21] a grid of all the surrounding input [4:23] pixels. And this is done with a 2D grid [4:26] of numbers called a kernel. Usually with [4:28] a size of like 3x3 or 5x5 where the [4:32] output pixel is determined by [4:34] multiplying the surrounding input pixels [4:36] by the corresponding number in the [4:37] kernel and then adding everything up. [4:40] For example, here's a vertical edge [4:42] detection kernel and here's a horizontal [4:45] edge detection kernel. You should start [4:47] to see why convolutions work so well for [4:49] images. If we have a 5x5 kernel instead [4:51] of a fully connected layer for a 100 by [4:53] 100 image, that's only 25 parameters [4:56] that we can reuse over and over again [4:58] instead of 100 million. So now that you [5:00] understand how convolutions work, it's [5:02] time to talk about its significance to [5:04] computer vision, which is basically the [5:06] field of identifying what's in an image. [5:10] Level one of computer vision is simply [5:12] image classification, where the network [5:14] just labels what is in an image. we have [5:17] to assume there's only one object in the [5:19] image and we don't know where exactly it [5:21] is but we know what it is. Now level two [5:24] is classification with localization [5:27] where we can also only have one object [5:29] but the network also gives us a bounding [5:31] box which tells us where it is in the [5:33] image. Level three is object detection. [5:36] So now the image can have multiple [5:38] objects and we get multiple bounding [5:40] boxes and labels around each of them. [5:42] But it's still a very rough estimate of [5:44] what pixels in the image is that object [5:46] because all the bounding boxes are just [5:48] rectangles. So it's at level four which [5:51] is semantic segmentation that each pixel [5:54] in the image gets labeled for what it [5:56] is. Now we can have the exact shape of [5:58] whatever it is in the image that we want [6:00] to identify and this is good for things [6:02] like background removal. Level five is [6:04] instance segmentation where not only [6:07] does the program classify what thing [6:08] each pixel is, it can also identify [6:11] multiple instances of that thing. Like [6:13] if there's multiple people in a picture [6:16] the invention of stable diffusion starts [6:18] with level four semantic segmentation [6:21] and specifically for biomemed images. So [6:24] we're talking about images of cells [6:26] neurons, blood vessels, organs and [6:28] whatnot. This is helpful for uh [6:30] diagnosing diseases, researching anatomy [6:33] stuff. Okay, I'm going to be honest. [6:35] It's not that important why biomedical [6:37] image segmentation is important. All [6:38] that you need to know is that people [6:40] were trying to segment images of cells. [6:42] And if you're thinking, what on earth [6:43] does this have to do with image [6:45] generation? Well, I promise you it's [6:47] going to make sense in a bit. And that's [6:48] when the genius comes. For a while [6:50] image segmentation was inefficient and [6:52] required thousands and thousands of [6:54] training samples. for biomedical image [6:57] tasks. A lot of times there weren't [6:58] enough, images., Or, at least, that, was, the [7:00] case until 2015 when a group of computer [7:03] scientists submitted a paper proposing a [7:05] new network architecture which would [7:07] then go on to be cited by over 60,000 [7:10] people. This is definitely one of the [7:12] more influential breakthroughs in [7:14] machine learning. Let's talk about the [7:16] unit. [7:18] A unit is full of convolutional layers [7:20] in order to do semantic segmentation on [7:22] cell images. But it's kind of weird in [7:24] how it does it because it first scales [7:26] down the image to a really low [7:28] resolution and then it scales it back up [7:31] to its original resolution. That sounds [7:33] counterintuitive at first, but it's kind [7:35] of genius. And I'm going to demonstrate [7:37] with a unit that I wrote myself. I wrote [7:39] this unit for this fish data set on [7:41] Kaggle from which I acquired 500 images [7:44] of different fish at a fish market along [7:47] with their corresponding black and white [7:48] masks of what pixels in the image make [7:51] up the fish. If you don't know, Kaggle [7:53] is a website dedicated to data science [7:55] and machine learning. And I'll leave a [7:56] link to the data set in the description. [7:58] So, the black and white masks are what's [8:00] known as the ground truth because that's [8:02] what we are comparing the network's [8:04] outputs against and training the network [8:06] to try to achieve. At first, the unit's [8:09] output when you give it this fish is not [8:11] so meaningful, but after a bit of [8:13] training, [8:18] it's able to identify the shape a lot [8:20] better. The colors are like this because [8:22] the values are outside the range of 0 to [8:25] 1. So the image is rendered in what's [8:26] known as pseudo color. But if we clamp [8:29] it to between 0 and one, the final [8:31] output is essentially the same as the [8:32] provided masks. Now that we have a [8:34] trained network, it's time to open it up [8:36] to see what's inside and figure out how [8:38] does a unit segment images so [8:40] efficiently. Remember, prior methods [8:42] required thousands of sample images, but [8:44] I've only given this one 500 images and [8:47] it's doing pretty well. So when this RGB [8:49] image of a fish gets inputed into the [8:51] unit, it's represented in computer [8:53] memory as a 3D grid of numbers because [8:56] it has a width, height, and three [8:58] channels. So this is a three-dimensional [9:00] tensor in machine learning language. Now [9:03] at the start, the image only has three [9:05] channels to represent redness [9:07] greenness, and bless. But what if it [9:10] could have more channels to represent [9:12] more information like what part of the [9:15] image corresponds to the body of the [9:17] fish, what part is the cutting board [9:19] what parts the shadow, what parts the [9:20] highlights, and so on. So that's [9:22] essentially the whole point of [9:23] convolutions. It's to extract features [9:26] from an image from how the pixels relate [9:28] to each other. And what makes [9:30] convolutions even more powerful is when [9:32] there's more channels in the image than [9:34] just one channel, because then the [9:36] kernel is a 3D grid instead of just a 2D [9:38] one. [9:40] The first half of the unit has all these [9:41] convolutional blocks that makes the [9:43] number of channels in the image go from [9:45] 3 to 64 to 128 to 256 to 512 and finally [9:51] to 1,024 [9:53] in the convolution from 64 to 128 [9:56] channels. For example, each kernel is 64 [9:59] layers deep and there's 128 of those [10:01] kernels. That's how the network can [10:04] extract more and more complex features [10:06] from the image. Slight issue though. [10:08] Even though the kernels get deeper and [10:09] deeper, they still have a fixed field of [10:12] view on the image. In this case, a 3x3 [10:15] field of view. In order to better [10:17] extract features from the image [10:18] obviously the kernels are going to have [10:20] to see more of the image. So, how can we [10:22] make the field of view bigger? Well [10:25] just making the kernels bigger rapidly [10:27] increases the number of parameters that [10:29] we have, which makes it inefficient. So [10:30] the unit uses a really smart and [10:32] efficient alternative. If we can't make [10:35] the kernels bigger, then just make the [10:38] image smaller. So after every two [10:40] convolutional blocks, the image gets [10:42] scaled down before it goes into the next [10:44] two convolutional blocks. This increased [10:46] field of view is how the network can [10:48] capture more context within the image to [10:51] better understand it. So let's see what [10:53] our fish has turned into in the middle [10:55] of the unit, where there's the most [10:56] number of channels, but the resolution [10:58] is the smallest. Out of the 124 [11:01] channels, we can see that some of them [11:03] highlight the body of the fish, some of [11:05] them the background, some of them [11:07] highlight the brighter area above the [11:08] fish, and some of them the darker area [11:11] below. Just as we said before. So, at [11:14] this point, the network has learned all [11:16] the information on what is in the image [11:18] but the downscaling has made it lose [11:20] information on where it is in the image. [11:23] So in the second half of the unit, we [11:25] start scaling it back up again and [11:27] decrease the number of channels using [11:28] these convolutional blocks to kind of [11:31] consolidate and summarize up all that [11:33] information that we gathered in the [11:34] first half. But how do we get back all [11:36] the lost detail from the downsampling in [11:38] the first half? The answer is what's [11:40] known as residual connections where [11:42] every time the resolution is increased [11:45] the information from the previous time [11:47] the image was that resolution is [11:49] literally just slapped onto the back and [11:50] combined with it. And then the [11:52] convolutional layers mix the information [11:54] back in. If we compare the fish image at [11:56] its highest resolution in the beginning [11:59] to where it's at its highest resolution [12:01] in the end, we can see that the [12:02] different parts of the image are much [12:04] better segmented. And that's how through [12:07] one final convolution, we get this very [12:09] clean mask. Yeah. So, units are really [12:13] good at segmenting images. There was [12:14] this international image segmenting [12:16] competition where the people who [12:18] invented the unit just went in there and [12:20] demolished everyone. Here's them getting [12:21] the award for it. What a bunch of nerds [12:23] to be honest. No, I'm just kidding. I [12:25] mean that in an endearing way. But [12:27] anyways, okay. When are we actually [12:29] going to get to the image generation? [12:31] We're getting there. Listen up. The UNET [12:33] is so good at identifying things within [12:36] an image that people started using it [12:38] for other stuff other than semantic [12:40] segmentation. Specifically, it could be [12:42] used to dn noiseise an image. If a noisy [12:44] image is just the sum of the original [12:46] image plus some noise, then if you [12:49] identify the noise in the image, then [12:52] you can just minus it away to get the [12:54] original. In fact, that's exactly what [12:56] we're going to try to do. So, allow me [12:58] to demonstrate with another image of a [13:00] fish. This time with a resolution of 64x [13:03] 64. This time, there is no black and [13:06] white ground truth mask to go with it. [13:09] Instead, we generate a bunch of noise to [13:10] be our ground truth because that's what [13:13] we're trying to train the network to [13:14] identify. It's important that during [13:16] training, we train on many copies of the [13:18] image with different amounts of noise [13:20] added in so that it's able to dn [13:22] noiseise really noisy images as well as [13:25] not so noisy ones. And here's where an [13:28] interesting challenge arises. How do we [13:30] provide the network with the knowledge [13:32] of how noisy each image sample is? [13:35] Because that's obviously going to affect [13:36] the outcome. So if you imagine all the [13:39] possible noise levels placed in a [13:41] sequence, the information here of how [13:44] noisy any sample is is basically a [13:47] number of that sample's position in that [13:50] sequence. So this is called positional [13:52] encoding. So let's say for this [13:54] particular image, its noise level [13:56] corresponds to the 10th position in the [13:58] sequence. Now we've got a 64x 64 image [14:01] with three channels, meaning there's [14:03] 12,288 [14:05] numbers in total. Do we just slap a 10 [14:08] on the end making it 12,289 numbers? Is [14:11] that going to work? No. So, here's how [14:13] positional encoding works. And I get it. [14:15] You might be thinking, okay, this seems [14:17] like not such a significant detail. Why [14:20] do we need to go through it? This might [14:21] be like the fifth time I've said this [14:23] but it's going to come up later again. [14:25] It's going to be important. Positional [14:26] encoding is a type of embedding which is [14:29] when you take discrete variables like [14:32] words, hint later on, or in this case [14:35] positions in a sequence and turn it into [14:38] a vector of continuous numbers to feed [14:41] to the network as a more digestible form [14:44] of information that it can then use. The [14:47] way that our 10 gets converted into a [14:49] vector of continuous numbers is using [14:51] these sign and cause equations here. So [14:53] that the vector of numbers always stays [14:55] within a fixed range. But each position [14:57] is encoded by a unique combination of [15:00] numbers in the vector since the [15:02] different elements of the vector are [15:03] given by sign and cos functions of [15:05] different frequencies. And then it gets [15:07] added onto the image data repeatedly at [15:10] every point in the unit where it changes [15:12] resolution to really drill in the [15:14] information of how much noise is in the [15:17] image just to help the network get it [15:19] right. Okay, that was an information [15:22] overload, but we can finally start the [15:24] training process. And you can see that [15:25] at first the noise that the network [15:27] predicts is obviously way off. It looks [15:29] nothing like the actual noise that we [15:30] gave it. But after a while, it looks [15:32] pretty similar to the ground truth [15:33] noise. And we can't really notice any [15:36] improvements anymore. So let's instead [15:38] show the dnoised version where we minus [15:41] this prediction to get an image and see [15:44] how that improves. [15:50] So yes, as you can see, we do end up [15:53] with the original fish image, but it is [15:55] kind of low quality and blurry. This is [15:57] because trying to go from pure noise to [16:00] the original image all in one step is [16:02] too hard. So instead, what we're going [16:04] to do is we don't get rid of all the [16:07] noise at once. We only get rid of some [16:09] of it. And then we feed it back into the [16:11] network again and get rid of a little [16:13] bit more. And then again, and then [16:15] again. And as you can see, removing the [16:17] noise in small baby steps like this [16:19] eventually gives us the clear, high [16:22] quality original image. And this is why [16:24] we've had to feed the network images of [16:26] varying degrees of noise to train on [16:28] because that's how it can work for the [16:30] whole process of dnoising. [16:33] Now, it's obvious that the network's [16:34] going to give us the same fish all the [16:36] time because we only trained it on one [16:38] image. But allow me to demonstrate what [16:40] happens when I take 5,000 32x32 images [16:44] of ships from the famous Sciphar 10 data [16:47] set. The Sciphar 10 data set has 10 [16:50] classes corresponding to 10 different [16:52] objects, each with thousands of 32x32 [16:55] images. Now, I'm not going to lie, at [16:57] first I accidentally put all 10 classes [16:59] of images into the network, so it was [17:01] going 10 times slower than it needed to [17:02] be, and I stopped it early. But then I [17:04] looked at some of the results and while [17:06] most of it was nonsense, here is what I [17:09] believe to be a red panda wearing [17:11] sunglasses and using a green tent as a [17:14] turtle shell. And this I think is an [17:17] orange boat wearing ice skating shoes [17:20] with a mohawk. [17:22] I don't know how this happens. Maybe [17:23] that's the magic of AI. But anyways, I [17:25] started training it again on the ship [17:27] images. And at first it was giving us [17:29] rubbish. But eventually we can see that [17:30] it comes up with completely new images [17:32] of ships using the knowledge that it's [17:34] gathered. And that my friend is called a [17:38] diffusion model. Now maybe it's come to [17:40] your attention that what we have so far [17:42] is not very efficient. I mean it took [17:44] forever to train on these 32x32 images. [17:47] Imagine how long it's going to take for [17:49] an HD or a 4K image. And the reason is [17:51] we're doing the noise prediction and [17:53] dnoising directly on the pixels right [17:55] now. And there's a lot of pixels meaning [17:57] a lot of data. So let's think about [17:59] whether there's a way we can reduce the [18:01] amount of data that we have to work with [18:02] to speed up this process. Imagine this. [18:05] I show you this image right here and you [18:07] have to tell your friend what's in the [18:09] image. Are you going to read out all the [18:11] values of the individual pixels to your [18:13] friend in order to transfer the [18:15] information over? No. You'll tell them a [18:17] description of blue sofa in a white room [18:20] with a cactus to its right and a coffee [18:22] table in front. And then they can use [18:24] their life experience and knowledge of [18:26] different objects to kind of imagine and [18:28] reconstruct what it's roughly supposed [18:30] to look like. It's not going to look [18:32] exactly the same, but it's going to be [18:34] good enough. Let's use another example [18:37] more relevant to computers. Usually, we [18:39] don't store images as just their raw [18:41] uncompressed pixel values, and instead [18:43] we use a file format like JPEG, which [18:45] can reduce the amount of data by many [18:47] many times. And then when the file gets [18:49] decoded to display on your screen, it's [18:52] a bit lower quality than the original [18:53] but again, it's good enough. Now, notice [18:56] how in both of these examples, there's a [18:58] process of encoding, which is you coming [19:01] up with a phrase to describe the image [19:03] or the JPEG compression. And then [19:05] there's a process of decoding, which is [19:07] your friend imagining what the image is [19:09] supposed to look like, or the JPEG [19:12] decompression. So what people invented [19:14] is a neural network equivalent of this [19:17] known as autoenccoders which are [19:19] basically trained to encode data into [19:22] what's known as a latent space and then [19:25] decoded the best that it can back to the [19:28] original data. Here's a demo of a latent [19:31] space that's been trained for the amnest [19:33] digit data set. In this case, the 28x 28 [19:37] equals 784 pixels got encoded into just [19:42] two numbers, which means we can [19:43] visualize it as a two-dimensional space [19:46] and drag around this point to see what [19:47] the different areas correspond to when [19:49] it's decoded. Now, of course, if it's [19:52] five or 10 numbers in the latent space [19:53] instead of two, you can get higher [19:55] fidelity reconstructions. In stable [19:58] diffusion, RGB images which are 512 x [20:00] 512 corresponding to 786,000 [20:04] numbers are encoded into a latent space [20:07] that's 4x 64x 64 equals 16,000 numbers. [20:13] Now that's 150th of the original amount [20:15] of data. So instead of directly adding [20:18] noise to images in their pixel space and [20:21] then dnoising those images, the images [20:23] are first encoded into this latent space [20:27] and then we noise and dn noiseise that [20:30] and then when it's decoded we roughly [20:32] get the original image again. This is [20:35] called a latent diffusion model and it's [20:38] one of the key improvements to the basic [20:40] diffusion model because it's so many [20:42] times faster than running dinoising on [20:45] the raw uncompressed data. Now, up until [20:47] this point, we still haven't addressed [20:49] something very important. How do we make [20:51] it generate images based on a text [20:54] prompt? Not going to lie, I think this [20:56] might be where it gets really hard to [20:58] understand. So, if you don't get [21:00] anything from here on out, don't stress [21:01] about it. Because if you made it this [21:02] far in the video, that's already pretty [21:04] impressive. But anyways, here's where we [21:06] use that embedding concept from earlier. [21:09] All these words, which are discrete [21:11] variables, have to be encoded into [21:13] vectors, just like the sequence position [21:15] numbers representing how noisy the [21:17] images are. The way that people found [21:19] good embeddings for words is this method [21:22] called word tovec. I won't go into too [21:24] much detail, but basically they had a [21:27] list of a bunch of vectors, one for each [21:30] word in the English language. And [21:32] actually, they had two of these lists. [21:35] Then they use data of all the text [21:37] that's ever been written by humans from [21:39] books and the internet and whatnot to [21:41] try to adjust these two lists of word [21:43] vectors so that the vector of a word in [21:46] one list would be similar to vectors of [21:49] words that it often appears next to in [21:52] the other list. Similar meaning it has a [21:55] larger dot product. So for example, the [21:57] words tall building are more likely to [22:00] appear together than the words tall [22:02] electricity. So if you look at the [22:04] vector for tall in one list, it'll be [22:07] similar to the vector for building in [22:10] the other list, but not similar to the [22:13] vector for electricity in the other [22:15] list. So once you've trained it enough [22:17] the relationship between word vectors in [22:19] opposite lists is that the more likely [22:22] they are to appear next to each other [22:24] the more similar they are. But what does [22:27] this mean for the relationships between [22:29] words in the same list? in the same [22:31] list, the more likely they appear in [22:34] similar contexts, the more similar they [22:36] are. So, if we just take one of the [22:39] lists as the embedding vectors for all [22:41] the words and graph it out, we'll find [22:44] that words that are used in similar [22:46] contexts are grouped closer together. [22:49] Here's a cool visualization on [22:51] projector.tensorflow.org [22:53] where if you click a dot representing a [22:55] word, it shows you the closest words [22:57] around it. Of course, it seems as though [22:59] they're kind of spread out and they're [23:00] not really the closest words, but that's [23:03] because this only visualizes three [23:04] dimensions, while the word vectors [23:06] actually have 300 dimensions, meaning [23:09] each word is represented by 300 numbers. [23:11] With all those dimensions, it turns out [23:13] these word embeddings can actually [23:15] capture some of the more nuanced [23:17] relationships between words. And the [23:19] most famous example is if you take the [23:21] vector for king and you subtract the [23:24] vector for man but then you add the [23:26] vector for woman you end up with the [23:30] vector for queen. Another example is you [23:32] can take London subtract England add [23:36] Japan and then you end up with Tokyo. So [23:39] I hope you can see how genius this word [23:42] embedding vector space is. Now remember [23:44] how earlier I said there's two types of [23:46] network layers that are really important [23:47] to stable diffusion. And the first one [23:49] is the convolutional layer. Well, it's [23:52] time to introduce the second one which [23:54] is called the self attention layer. [23:56] Let's think back to convolutions for a [23:58] second. Convolutional layers extract [24:00] features from an image using [24:02] relationships between pixels where the [24:05] amount that each pixel influences [24:07] another pixel is dependent on their [24:09] relative spatial position. So, a self- [24:12] attention layer extracts features from a [24:14] phrase using the relationships between [24:15] the words where the amount of influence [24:18] words have on each other is determined [24:20] by their embedding vectors. To build up [24:22] the simplest possible self attention [24:25] layer, it's kind of like a fully [24:26] connected layer, but each input and [24:29] output is a vector instead of a single [24:30] number. And the weights of connections [24:32] are not parameters that it tries to [24:34] learn. Rather the weight of the [24:37] connection between A and B is determined [24:40] by the dotproduct between A and B. So in [24:43] the simplest model the output is [24:44] entirely dependent on the input since [24:46] there's no parameters that we can [24:48] control. But we do want to control it [24:50] obviously like if we have a convolution [24:52] where the kernel just has all the same [24:54] numbers then yeah sure it helps us [24:56] understand how a convolution works but [24:58] all it does is just blur the image and [25:00] it's not that useful. So how can we [25:02] control this self attention layer so [25:04] that just like we make a convolution [25:06] detect edges for example, we make it [25:09] detect, I don't know, words that negate [25:11] or emphasize certain adjectives. [25:14] Let's break this attention process down [25:16] into its components. In our simple [25:18] attention layer, the amount that A's [25:21] input influences B's output is [25:25] determined by the dot product. And the [25:26] amount that B's input influences A's [25:29] output is also determined by the dot [25:32] product. So it's the same. Now let's [25:34] focus on the part where A influences B. [25:37] I'm going to characterize this process [25:38] as a conversation between A and B, which [25:42] sounds hella goofy, but I promise it'll [25:44] make sense. So B goes up to A and says [25:47] "Hello, I'm B. Here is my ID. show me [25:51] your ID so that we can compare it and [25:54] decide how much you influence my output. [25:57] And then A says, "Yeah, I'm A. Here's my [25:59] ID and here's my data which I'll pass [26:03] over to your output after the [26:04] comparison." So B's ID is called the [26:07] query vector. A's ID is called the key [26:11] vector. The process of comparison is the [26:13] dotproduct between them and A's data is [26:16] called the value vector. There I just [26:19] explained query key and value and self [26:22] attention. In our simple self attention [26:24] of course, the query is just vector B [26:26] itself and both the key and the value [26:29] are just vector A. In other words, each [26:32] vector has to serve a total of three [26:34] purposes over the whole process even [26:37] though it's the same vector the whole [26:38] time. So here's how we introduce in [26:40] parameters to control this whole self [26:42] attention process. How do we manipulate [26:45] vectors using matrices? [26:51] So, we have three matrices called, can [26:54] you guess what they're called? [26:56] That's right, the query matrix, the key [26:59] matrix, and the value matrix. And these [27:01] are applied to each vector before they [27:04] go to serve their purpose as a query [27:06] key, or value. Now, the amount that A [27:09] influences B is no longer the same as [27:12] the amount that B influences A. This way [27:15] it's not just similar vectors that [27:16] influence each other anymore due to the [27:19] simple dot product. Now we can use any [27:21] feature in each word's massive 300 [27:25] dimension vector which is what makes [27:26] self attention layers so powerful in [27:29] extracting features in the relationships [27:31] between words. One more thing though [27:33] right now the output is still not [27:35] affected by the relative position of [27:37] each word in a phrase and that's really [27:39] important in determining the meaning of [27:41] the phrase. So in order to encode the [27:43] position of each word, we once again use [27:45] the positional encoding method covered [27:47] earlier and just add on those positional [27:50] embedding vectors to the word embedding [27:52] vectors. Wow, that was another [27:54] information overload. You need a break? [27:58] Okay, let's take a break. [28:00] All right,, let's, resume., Think, about [28:02] this. Using convolutional layers, we can [28:04] encode an image into a small embedding [28:06] vector. And using attention layers, we [28:10] can also encode a text phrase into a [28:12] small embedding vector. But look at the [28:14] image and the text that I've selected to [28:16] display here. The caption perfectly [28:19] describes the image. So imagine if the [28:22] two encoders could come up with the same [28:25] embedding vector even though they're [28:28] dealing with two different types of [28:29] data. So that's exactly what Open AI [28:33] tried to do with their clip text model [28:35] where clip stands for contrastive [28:38] language image pre-training. This clip [28:40] text model has both an image encoder and [28:43] a text encoder and they trained it on [28:46] 400 million images so that images and [28:49] captions that match are supposed to come [28:52] out with very similar embeddings and the [28:54] ones that don't match are supposed to [28:56] come out with really different [28:57] embeddings. So it kind of makes sense [28:59] that the text embeddings that come out [29:02] of this clip text model which are [29:04] already matched to encoded images are [29:07] perfect to stick into our dnoising unit [29:10] which also encodes and decodes images as [29:12] a part of how it works with the [29:14] convolutions and scaling and all that. [29:15] So yes in stable diffusion we just take [29:17] text embeddings generated by clip and [29:20] inject it into the unit multiple times [29:23] using attention layers. Well, this time [29:25] a slightly different type of attention. [29:28] This time it's not self attention which [29:31] just operates on one set of input [29:32] vectors. We're adding the text [29:34] information into the image. So obviously [29:36] it's two sets of input data. So instead [29:39] it's a process called cross attention [29:42] which is literally like self attention [29:44] except the image is going to be the [29:47] query and the text is going to be the [29:49] key and value. That's it. These cross [29:52] attention layers in the middle of the [29:53] unit are going to extract relationships [29:55] between the image and the text. So that [29:58] whatever features in the image can get [30:00] influenced by the most important and [30:02] relevant features in the text. And this [30:04] is how we eventually train the network [30:05] to generate images based on the text [30:07] captions we give it. So there we have [30:10] it. Convolutional layers learn images. [30:12] Self attention layers learn text. And [30:14] then when you combine the two, you can [30:16] generate images based on text. Pretty [30:19] interesting, is it