Abstract

Diffusion breaks down data over multiple time steps and trains a model to iteratively reconstruct it. Thus, the model learns a way to transition samples from senseless noise back to the data distribution.

Theory

Diffusion iteratively destroys data through gaussian noise and learns a way to reverse time and recreate the original image. If we successfully learn such a model, then we can generate any picture from random noise. This idea is also similar to ๐Ÿงจ Score SDEs.

Forward Process

Diffusion models destroy the training distribution of images by repeatedly applying gaussian noise for time-steps, known as the forward process. represents the distribution after applying noise for steps; is the original image, and is isotropic noise. The forward process equation is

where is controlled by a variance schedule, which dictates how much noise to add at every time step . usually increases as increases.

Reverse Process

Our goal is to learn the reverse process , which is also a gaussian. The two processes are illustrated below.

Optimization

To find optimal parameters for , we optimize the log likelihood of using the ๐Ÿงฌ Evidence Lower Bound. After a series of derivations, we can express the ELBO with multiple terms:

Each term has its own function.

  1. The first is a reconstruction term for getting back the original image.
  2. The second is the distance between our final noisy latent space and the standard gaussian prior.
  3. The third is a de-noising matching term within each time step.

De-Noising Objective

The third term contains , which must be derived from using Bayesโ€™ rule.

First, weโ€™ll find . Let and all , and observe that can be expressed in terms of .

Info

Note that to combine the two gaussians and , our new gaussianโ€™s variance is the sum of the two variances. In other words, the combination of and is .

Applying this pattern all the way down, we get

where .

Then, plugging this into

where and .

Plugging this back into the KL term, we get

where is the scalar variance from our gaussian above. Thus, we have shown that our objective is to predict the mean of the noise.

However, we can also replace with an expression in terms of to get

that says we can also predict the original image directly. Finally, we can express in terms of and to get

Thus, predicting original image , the noise , and the mean of the noise (assuming we set variance constant) is equivalent, but the second option generally works best in empirical studies.

Model

The model itself takes in and the time-step to predict noise . This is commonly done using a ๐Ÿ‘๏ธ Convolutional Neural Network or ๐Ÿฆฟ Vision Transformer.

Training

We train our model to minimize a simplified loss defined as

where , and is uniformly chosen from at each gradient descent step.

Prediction

To sample an image, we first generate random noise . Then, for iterations,

  1. Let if , else .
  2. Move one step back in time with

where can be either or as defined above; empirical tests show little difference in results.

Note

The formula in the second step is derived from the equation for in terms of and . Specifically,

Finally, is our generated image.