But how do AI images and videos actually work? | Guest video by Welch Labs

0h 37m video Published Jul 25, 2025 Transcribed Jun 14, 2026 3Blue1Brown

3Blue1Brown

Advanced 20 min read For: Machine learning practitioners and researchers with a background in deep learning and some familiarity with generative models.

AI Trust Score 95/100

✅ Highly Legit

"Title accurately describes the content: a deep dive into how AI images/videos work, with clear explanations and examples."

AI Summary

This video explains how AI image and video generation models work, focusing on diffusion models and their connection to physics. It covers key concepts like CLIP embeddings, the diffusion process, and guidance techniques, using a toy 2D dataset to build intuition.

Chapters

1 Introduction and Motivation 00:03 2 CLIP: Learning Shared Text-Image Embeddings 02:22 3 Diffusion Models: From Noise to Images 08:16 4 DDIM and Faster Generation 21:59 5 Combining CLIP and Diffusion: Text-Conditioned Generation 25:59 6 Conclusion and Reflections 34:40

[00:03]

AI video generation and physics

AI image/video models use a process called diffusion, analogous to Brownian motion but run backwards in a high-dimensional space.

[00:42]

WAN 2.1 open-source model

The video uses WAN 2.1, an open-source text-to-video model, to demonstrate generation. It starts with random noise and iteratively denoises it.

[02:22]

Three parts of the video

The video covers: 1) CLIP model for shared text-image embeddings, 2) diffusion process and its physics connection, 3) combining CLIP and diffusion for text-guided generation.

[03:58]

CLIP model architecture

CLIP consists of two models (text and image encoder) trained on 400 million image-text pairs to produce similar embeddings for matching pairs.

[05:28]

Contrastive learning in CLIP

CLIP uses a contrastive objective: maximize similarity between matching image-text pairs and minimize similarity between non-matching pairs, using cosine similarity.

[06:16]

Properties of CLIP embedding space

The embedding space allows arithmetic operations; e.g., subtracting 'no hat' from 'with hat' yields a vector closest to 'hat' in text space.

[08:20]

DDPM paper introduction

The DDPM paper (2020) showed high-quality image generation by reversing a diffusion process that adds noise step by step.

[09:48]

Key details of DDPM training

Training adds random noise to images; generation also adds noise at each step. The model predicts the total noise added, not just one step.

[11:56]

Diffusion models learn a time-varying vector field

Thinking of diffusion models as learning a vector field that points toward the data manifold. The field is conditioned on time to capture coarse-to-fine structure.

[14:57]

Why predicting total noise works

Predicting total noise reduces variance compared to step-by-step denoising, making training more efficient while preserving the same learning objective.

[17:34]

Why adding noise during generation helps

Adding noise prevents all points from collapsing to the mean of the data distribution, which would produce blurry images. It allows sampling from the learned distribution.

[21:59]

DDIM: deterministic generation

DDIM uses an ODE derived from the SDE via the Fokker-Planck equation, enabling high-quality generation without random noise steps and with fewer steps.

[25:59]

Combining CLIP and diffusion: unCLIP / DALL-E 2

OpenAI's unCLIP (DALL-E 2) trains a diffusion model to reverse the CLIP image encoder, using CLIP text embeddings to guide generation.

[26:43]

Conditioning diffusion models on text

Text embeddings can be passed as additional input to the diffusion model (conditioning). Various methods exist: cross-attention, addition, concatenation.

[28:25]

Classifier-free guidance

Guidance amplifies the difference between conditional and unconditional model outputs, steering generation toward the desired class or text prompt.

[33:39]

WAN 2.1 uses negative prompting

WAN 2.1 uses a negative prompt (e.g., 'extra fingers, walking backwards') to steer away from unwanted features, subtracting its embedding from the conditional one.

Diffusion models, combined with text embeddings from models like CLIP and guidance techniques, enable high-quality text-to-image and text-to-video generation. The field has advanced rapidly since DDPM, with open-source models like WAN 2.1 and Stable Diffusion making the technology accessible.

Mentioned in this Video

WAN 2.1

tool

CLIP

tool

Stable Diffusion 2

tool

DALL-E 2

tool

Stephen Welch

person

Welch Labs YouTube channel

link

Study Flashcards (12)

What physical process is analogous to the forward process in diffusion models?

easy Click to reveal answer

Brownian motion.