The standard ๐Ÿ–‹๏ธ Variational Autoencoder generation process canโ€™t be controlled. That is, we donโ€™t give any input into the generation. This is because the encoder models directly on without considering any external information, and our prior only models , the latent distribution.

To remedy this, we can introduce a conditioning term (a label or prompt, for example) to the encoder and decoder to get and respectively. Then, our generation can be written as

and we can optimize the modified ELBO

GSNN

One drawback with the formulation above is that our recognition network is used during training whereas the prior is used during generation. Through the KL term closes the difference, this difference can still be significant empirically.

Alternatively, we can set them to be equal, , which gives us the Gaussian stochastic neural network (GSNN). The loss for training this CVAE variant is just the first term of the ELBO since our KL is .