Feature-wise linear modulation (FiLM) is a general technique for influencing a neural networkโ€™s output via some external conditioning input. A FiLM layer allows us to inject conditioning into the intermediate activations by using conditioning to perform an affine transformation on the features.

Formally, FiLM learns a โ€œgeneratorโ€ consisting of two functions and that output

Then, for a networkโ€™s activations , the film layer performs

and are generally unique to each feature map (in the case of CNNs) or feature, so this allows extreme flexibility in influencing the activations.

Notably, empirical results show that FiLM doesnโ€™t require to be normalized pre-transformation. FiLM thus generalizes all prior Conditional Normalization approaches under this simple framework.

VQA Model

FiLM is especially effective with vision-language tasks due to the multimodal input capability. Specifically, for visual question-answer problems, we have a ๐Ÿ‘๏ธ Convolutional Neural Network visual pipeline with FiLM layers and a โ›ฉ๏ธ Gated Recurrent Unit generator. The GRU is responsible for processing the question semantics, and injecting this information into the CNN allows us to perform accurate answer predictions.