StyleGAN improves on the generator design of a standard ๐Ÿ–ผ๏ธ Generative Adversarial Network by incorporating ideas from style-transfer. Specifically, we introduce an intermediate latent space that we can interpret as the style code. The motivation behind this change is that by introducing and injecting it into the generative process multiple times, we have better control over the style of the image compared to sampling a seed at the start of generation.

To create this intermediate space, we use two models (trained end-to-end): the mapping network maps to , and the synthesis network generates the image.

Notably the synthesis network doesnโ€™t take as input. Rather, it starts from random input and injects in four stages via adaptive Instance Normalization (AdaIN). AdaIN first normalizes its input, then manipulates it by a scaling factor and bias factor that we derive from a learned affine transformation of our style code . Formally, for a feature-map ,

We also inject noise into the synthesis, each scaled by a learned factor , that controls stochastic image details like the curls in hair.

Disentanglement

The core contribution of this architecture is successful disentanglement across stochastic features (via noise injections) and style (via style injections). Moreover, this network also disentangles different scales of stylesโ€”since AdaIN normalizes its input before applying the style, previous applied styles arenโ€™t considered in later injections.

For example, in the picture above, we can take latent codes that generated source A and source B, then mix the styles together. Coarse, middle, and fine styles (injected from early to later in the network) capture a different range of scales from the head position and age to the hair color.

Another key disentanglement is between different style dimensionsโ€”age and gender, for example. The authors hypothesize that if our data distribution isnโ€™t uniformโ€”missing some combination of dimensionsโ€”our mapping from the latent code to features will cause entanglement. This problem is illustrated below when we try to contort the first L-shaped distribution to the circular uniform distribution.

If we use an intermediate instead, the mapping network can learn to โ€œun-warpโ€ this distortion. Crucially, since is learned, it allows us to sample from an arbitrary distribution rather than a fixed one (as with ). Thus, we can largely avoid the entanglement effects from contorting to a fixed circle.