Diffusion Models
Diffusion models generate data by learning to reverse a gradual noise-adding process. Starting from pure noise, the model iteratively denoises to produce realistic images. They have surpassed GANs in image quality and now power Stable Diffusion, DALL-E, Midjourney, and Sora. This page derives the forward and reverse processes, implements DDPM, explains classifier-free guidance, details the Stable Diffusion architecture, and covers LoRA and ControlNet.
The Idea
Forward Diffusion Process
Gradually add Gaussian noise over
where
Direct Sampling at Any Timestep
Using
This means we can sample
Derivation: By recursion. At step 1:
By induction:
Worked Example — Forward Noise Addition at Different Timesteps
Setup: 1D signal
Let
Step-by-step noising:
| 0 | 1.000 | 0.000 | 1.000 (clean) |
| 1 | 0.9995 | 0.0316 | 1.015 |
| 2 | 0.9985 | 0.0548 | 1.026 |
| 3 | 0.9970 | 0.0775 | 1.036 |
| 100 | ~0.951 | ~0.309 | 1.106 |
| 500 | ~0.607 | ~0.795 | 1.004 |
| 1000 | ~0.006 | ~1.000 | 0.506 (mostly noise) |
Result: At
Reverse Process
The reverse process is also Gaussian (for small
The model learns
Predicting Noise
Given
Substituting into the posterior mean:
Training Objective
The simplified DDPM loss:
This is just MSE between the true noise and the predicted noise.
Training Algorithm
repeat:
x_0 ~ q(x_0) # Sample clean image
t ~ Uniform({1, ..., T}) # Random timestep
ε ~ N(0, I) # Random noise
x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε # Noisy image
Take gradient step on ‖ε - ε_θ(x_t, t)‖²
until convergedSampling Algorithm
x_T ~ N(0, I) # Start from noise
for t = T, T-1, ..., 1:
z ~ N(0, I) if t > 1, else z = 0
x_{t-1} = (1/√α_t)(x_t - β_t/√(1-ᾱ_t) · ε_θ(x_t, t)) + σ_t · z
return x_0Noise Schedules
Linear Schedule
Original DDPM:
Cosine Schedule
Produces a more gradual noise increase, better for high-resolution images.
import torch
import numpy as np
def cosine_schedule(T, s=0.008):
steps = torch.linspace(0, T, T + 1)
f_t = torch.cos((steps / T + s) / (1 + s) * np.pi / 2) ** 2
alpha_bar = f_t / f_t[0]
betas = 1 - alpha_bar[1:] / alpha_bar[:-1]
return torch.clamp(betas, 0.0001, 0.999)U-Net for Diffusion
The denoising network is typically a U-Net with:
- Time embedding: Sinusoidal + MLP, added to each block
- Residual blocks: ConvBlock with GroupNorm
- Self-attention: At lower resolutions (16x16, 8x8)
- Cross-attention: For conditioning (text, class)
import torch
import torch.nn as nn
import math
class SinusoidalTimeEmbedding(nn.Module):
def __init__(self, dim):
super().__init__()
self.dim = dim
def forward(self, t):
half = self.dim // 2
freqs = torch.exp(-math.log(10000) * torch.arange(half, device=t.device) / half)
args = t[:, None].float() * freqs[None]
return torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
class ResBlock(nn.Module):
def __init__(self, in_ch, out_ch, time_dim):
super().__init__()
self.conv1 = nn.Sequential(
nn.GroupNorm(8, in_ch),
nn.SiLU(),
nn.Conv2d(in_ch, out_ch, 3, padding=1),
)
self.time_mlp = nn.Sequential(
nn.SiLU(),
nn.Linear(time_dim, out_ch),
)
self.conv2 = nn.Sequential(
nn.GroupNorm(8, out_ch),
nn.SiLU(),
nn.Conv2d(out_ch, out_ch, 3, padding=1),
)
self.skip = nn.Conv2d(in_ch, out_ch, 1) if in_ch != out_ch else nn.Identity()
def forward(self, x, t_emb):
h = self.conv1(x)
h = h + self.time_mlp(t_emb)[:, :, None, None]
h = self.conv2(h)
return h + self.skip(x)
class SimpleDiffusionUNet(nn.Module):
def __init__(self, in_ch=3, dim=64, time_dim=256):
super().__init__()
self.time_emb = nn.Sequential(
SinusoidalTimeEmbedding(time_dim),
nn.Linear(time_dim, time_dim),
nn.SiLU(),
nn.Linear(time_dim, time_dim),
)
# Encoder
self.down1 = ResBlock(in_ch, dim, time_dim)
self.down2 = ResBlock(dim, dim * 2, time_dim)
self.pool = nn.MaxPool2d(2)
# Bottleneck
self.bot = ResBlock(dim * 2, dim * 2, time_dim)
# Decoder
self.up2 = ResBlock(dim * 4, dim, time_dim)
self.up1 = ResBlock(dim * 2, dim, time_dim)
self.upsample = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=False)
self.out = nn.Conv2d(dim, in_ch, 1)
def forward(self, x, t):
t_emb = self.time_emb(t)
d1 = self.down1(x, t_emb)
d2 = self.down2(self.pool(d1), t_emb)
b = self.bot(self.pool(d2), t_emb)
u2 = self.up2(torch.cat([self.upsample(b), d2], dim=1), t_emb)
u1 = self.up1(torch.cat([self.upsample(u2), d1], dim=1), t_emb)
return self.out(u1)Classifier-Free Guidance
Classifier-free guidance (Ho and Salimans, 2022) trades diversity for quality by amplifying the conditional signal.
During training, randomly drop the conditioning (set it to null) with probability
During sampling, interpolate between conditional and unconditional predictions:
where
Worked Example — Classifier-Free Guidance
Setup: At one denoising step, the model produces noise predictions (scalar for simplicity):
- Unconditional prediction:
(generic noise) - Conditional prediction (prompt "a cat"):
(cat-specific noise)
Different guidance scales:
| Effect | ||
|---|---|---|
| 0 | Ignore condition entirely | |
| 1 | Standard conditional (no amplification) | |
| 3 | Moderately amplified | |
| 7.5 | Strongly amplified (typical SD setting) | |
| 15 | Very strongly amplified (oversaturated) |
Result: At
Stable Diffusion Architecture
Stable Diffusion operates in a compressed latent space rather than pixel space, making it tractable:
Components
- VAE encoder: Compress images from
to latent space (8x spatial compression) - U-Net: Predict noise in latent space (with cross-attention for text conditioning)
- CLIP text encoder: Convert text prompt to embeddings for cross-attention
- VAE decoder: Decompress denoised latent back to pixel space
Why Latent Space?
Operating at
fewer pixels - Attention is
instead of (quadratic savings) - Training on a single A100 instead of a cluster
LoRA: Low-Rank Adaptation
LoRA (Hu et al., 2022) fine-tunes large models by adding small trainable matrices:
where
Worked Example — LoRA Parameter Savings
Setup: A linear layer
Full fine-tuning:
LoRA with rank
LoRA parameters:
Savings:
LoRA with rank
Result: LoRA with
Parameter Savings
| Model | Full Fine-Tune | LoRA (r=4) | LoRA (r=16) |
|---|---|---|---|
| Stable Diffusion | 860M | 3.2M (0.4%) | 12.8M (1.5%) |
| LLaMA 7B | 7B | 4.2M (0.06%) | 16.8M (0.24%) |
# Using PEFT library
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj"], # Which layers
lora_dropout=0.05,
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 16,777,216 || all params: 7,000,000,000 || trainable%: 0.24ControlNet
ControlNet (Zhang et al., 2023) adds spatial conditioning (edges, depth, pose) to Stable Diffusion by creating a trainable copy of the encoder blocks:
where
Conditioning Types
| Condition | Source | Use Case |
|---|---|---|
| Canny edges | Edge detection | Preserve structure |
| Depth map | MiDaS | 3D-aware generation |
| OpenPose | Pose estimation | Character poses |
| Segmentation | Semantic map | Scene layout |
| Scribble | User drawing | Guided creation |
DDIM: Faster Sampling
DDPM requires
This is deterministic (
Minimal DDPM Training Loop
import torch
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# ── Noise schedule ───────────────────────────────────────────────────
def linear_schedule(T=1000, beta_start=1e-4, beta_end=0.02):
betas = torch.linspace(beta_start, beta_end, T)
alphas = 1 - betas
alpha_bar = torch.cumprod(alphas, dim=0)
return betas, alphas, alpha_bar
T = 1000
betas, alphas, alpha_bar = linear_schedule(T)
# ── Forward diffusion ────────────────────────────────────────────────
def q_sample(x_0, t, noise=None):
"""Add noise to x_0 at timestep t."""
if noise is None:
noise = torch.randn_like(x_0)
sqrt_alpha_bar = alpha_bar[t].sqrt().view(-1, 1, 1, 1)
sqrt_one_minus = (1 - alpha_bar[t]).sqrt().view(-1, 1, 1, 1)
return sqrt_alpha_bar * x_0 + sqrt_one_minus * noise, noise
# ── Data ─────────────────────────────────────────────────────────────
transform = transforms.Compose([
transforms.Resize(32),
transforms.ToTensor(),
transforms.Normalize([0.5], [0.5]),
])
dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
loader = DataLoader(dataset, batch_size=128, shuffle=True)
# ── Training ─────────────────────────────────────────────────────────
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SimpleDiffusionUNet(in_ch=1, dim=64).to(device)
alpha_bar = alpha_bar.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=2e-4)
for epoch in range(50):
total_loss = 0
for images, _ in loader:
images = images.to(device)
t = torch.randint(0, T, (images.size(0),), device=device)
noise = torch.randn_like(images)
x_t, _ = q_sample(images, t, noise)
predicted_noise = model(x_t, t)
loss = nn.MSELoss()(predicted_noise, noise)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}: Loss={total_loss / len(loader):.4f}")
# ── Sampling ─────────────────────────────────────────────────────────
@torch.no_grad()
def p_sample(model, x_t, t):
"""Single reverse step."""
beta_t = betas[t].to(device)
alpha_t = alphas[t].to(device)
alpha_bar_t = alpha_bar[t].to(device)
eps = model(x_t, torch.tensor([t], device=device).expand(x_t.size(0)))
mean = (1 / alpha_t.sqrt()) * (x_t - beta_t / (1 - alpha_bar_t).sqrt() * eps)
if t > 0:
noise = torch.randn_like(x_t)
return mean + beta_t.sqrt() * noise
return mean
@torch.no_grad()
def sample(model, shape=(16, 1, 32, 32)):
x = torch.randn(shape, device=device)
for t in reversed(range(T)):
x = p_sample(model, x, t)
return x
samples = sample(model)Diffusion vs GANs vs VAEs
| Feature | Diffusion | GAN | VAE |
|---|---|---|---|
| Training stability | Excellent | Poor | Good |
| Sample quality | Excellent | Excellent | Blurry |
| Diversity | Excellent | Mode collapse risk | Good |
| Sampling speed | Slow (many steps) | Fast (single pass) | Fast (single pass) |
| Likelihood | Tractable bound | None | Tractable bound (ELBO) |
| Controllability | Excellent (guidance) | Moderate | Moderate |
| Current SOTA (2026) | Yes | No | No |
Practical Tips for Diffusion Models
- Start with pretrained models: Do not train Stable Diffusion from scratch. Use LoRA or DreamBooth for customization.
- Guidance scale matters:
is a good default. Higher values are more faithful but less creative. - Schedulers affect quality: DDIM for speed (20 steps), DPM-Solver++ for quality (20-50 steps), Euler for balance.
- Negative prompts help: "blurry, low quality, distorted" in the negative prompt improves output.
- Seed control: Set seeds for reproducibility. Same prompt + seed + settings = same image.
Cross-References
- Autoencoders: Autoencoders --- VAE for latent space
- GANs: GANs --- previous state-of-the-art for generation
- U-Net: Image Segmentation --- U-Net architecture
- Text conditioning: Transformers --- cross-attention
- CLIP: Multimodal Models --- text encoder
- Efficiency: Model Optimization --- LoRA, quantization