Feature-wise linear modulation (FiLM) is a general technique for influencing a neural networkโs output via some external conditioning input. A FiLM layer allows us to inject conditioning into the intermediate activations by using conditioning to perform an affine transformation on the features.
Formally, FiLM learns a โgeneratorโ consisting of two functions
Then, for a networkโs activations
Notably, empirical results show that FiLM doesnโt require
VQA Model
FiLM is especially effective with vision-language tasks due to the multimodal input capability. Specifically, for visual question-answer problems, we have a ๐๏ธ Convolutional Neural Network visual pipeline with FiLM layers and a โฉ๏ธ Gated Recurrent Unit generator. The GRU is responsible for processing the question semantics, and injecting this information into the CNN allows us to perform accurate answer predictions.