CLIP is a model that relates images to text via contrastive pre-training. Crucially, it embeds with images and text into a shared latent space that can be used for zero-shot tasks using unseen datasets.
Embeddings
Unlike other classification models, CLIP doesnโt associate images with a label. Rather, itโs fed image-text pairs from the internet where the text contains natural language (like sentences). Given these pairs, CLIP learns image and text encoders (a ๐ช ResNet or ๐ฆฟ Vision Transformer and a ๐ฆพ Transformer, respectively) that map onto a shared latent space where the images and text from each pair have similar embeddings. These embeddings are incredibly valuable as they encode the fundamental meaning from both text and images, capturing both semantics and style; as such, CLIP encoders are used in many other works.
Contrastive Objective
To learn this latent representation and problem setup, we canโt use the standard predictive objective in classic methodsโpredicting the exact label from images. Instead, CLIP learns from contrastive pre-training, which aims to maximize the similarity between image and text embeddings from a pair while minimizing their similarity with other pairs. Specifically, it captures this with a symmetric cross entropy loss over cosine similarity.
Zero-Shot Transfer
With the pre-trained encoders, CLIP can adapt to any dataset distribution and label. For a given image and a set of possible text labels, CLIP simply predicts the label that has the highest similarityโcrucially, the contrastive objective allows it to predict labels that it hasnโt even seen before.