Variational autoencoders are generative models that define an intractable density function using latent variable :

Using a similar structure as ๐Ÿงฌ Autoencoders, we use an encoder and decoder to map to and back. However, we use a stochastic process instead of a deterministic one.

Optimization

We choose to be a simple distribution like a Gaussian and use our decoder to sample from the latent space. However, we cannot optimize the likelihood since integrating over the latent space is intractable. The solution is to use an encoder network that approximates . In doing so, we can use the ๐Ÿงฌ Evidence Lower Bound,

The first term can be estimated via ๐Ÿค” Monte Carlo Sampling, and the second can be computed analytically since itโ€™s the divergence between two Gaussians. Thus, optimizing the ELBO as a proxy objective with a reconstruction term and a prior regularization term.

Implementation

Concretely, we implement the network as two ๐Ÿ‘๏ธ Convolutional Neural Networks. The first predicts and from , and we sample . The second takes and predicts and , then samples . To backpropagate through the sampling for , we use the ๐Ÿช„ Reparameterization Trick.

Intuition

We use a probability distribution to force the decoder to recognize that small changes in the sampled latent vector should result in minor changes to the final image. For example, on a single training image, we can have multiple slightly-different sampled latent vectors, and the decoder must learn that they all correspond to the same image.

However, with multiple training examples, itโ€™s still possible for the encoder to generate completely different distributions with low variance and different means, essentially acting as a normal autoencoder. We can mitigate this by regularizing the probability distribution in our loss function, forcing them all to be similar to the standard normal .

By regularizing the distributions, we force the latent space to be a smooth distribution where any chosen point in the space can be meaningfully reconstructed, as shown in the third graph below.