Representation learning studies how we can train models to learn a semantically-meaningful representation embedding of the input data. That is, given some inputs (most commonly images), we want to โ€œcompressโ€ it into some form that preserves its meaning. Representations learned in this manner can then be adapted for any task by predicting the desired target using these representations rather than directly from the input.

Representations are generally learned via self-supervised learning, either with โ€œpretextโ€ tasks or contrastive objectives.

Pretext Tasks

Many self-supervised methods train a model in a supervised manner by augmenting the unlabeled dataset to artificially create pretext tasks. Some examples are as follows:

  1. For each image, form distorted variants via translation, rotation, scaling, or color shifts and learn to classify the distorted variants to their original.
  2. Rotate images randomly and learn to rotate them back to the original.
  3. Crop two patches from an image and learn to associate their relative spatial relationships. Similarly, we can also divide the image into patches, randomize their order, and learn to un-shuffle them back to their canonical positions.
  4. Recover the original image from a grayscale variant (known as colorization).

Other more advanced techniques take inspiration from ๐ŸŽจ Generative Modeling to learn a model that can generate parts of the input image:

  1. Denoising autoencoder: given an image corrupted by noise, recover the original.
  2. Context encoder: given an image with a missing patch, recover the original.
  3. Split-brain autoencoder (pictured below): predict complementary missing color channels using two sub-networks.

If we have video data, we can take advantage of the temporal relationships between frames for more powerful self-supervised tasks. Many of these tasks resemble contrastive objectives below.

  1. Given patches of the same object at different times and a random third patch, learn an embedding that makes the similar patches closer than with the third.
  2. Determine which sequence of frames is in the correct or incorrect temporal order.
  3. Predict a videoโ€™s arrow of time (whether itโ€™s playing forward or backward).
  4. Colorize videos by tracking pixels across frames (via embedding similarity) and coloring a new frame with tracked pixels from the former frame.

Contrastive Objectives

Rather than learning some pretext task, which might cause our features to be specialized toward that task, we can learn some general embedding space with a notion of โ€œsimilarityโ€ between samples. Contrastive objectives learn this space by attracting similar samples and repelling different ones.

  1. Contrastive loss: the standard contrastive loss seeks to minimize distances between samples with the same label and maximizes distances between those with different labels:

For dissimilar pairs, we define hyperparameter to lower bound the dissimilarity distance. 2. Triplet loss: rather than comparing pairs, we can compare an โ€œanchorโ€ with โ€œpositiveโ€ and โ€œnegativeโ€ samples, seeking to attract positives and repel negatives with

  1. Lifted structured loss: we can check all possible pairs in a batch; for positive pairs and negative pairs and distance :

In practice, the nested max operations can be relaxed to a soft maximum for smoother optimization,

![[20230406130932.png#invert|300]]

4. Multi-class -pair loss: we can generalize the triplet loss to push multiple classes at once. For be pairs of samples from classes with embeddings , we have

Note that the inner product is essentially the inverse of a distance norm; the more similar the two embeddings are, the larger their inner product. 5. โ„น๏ธ InfoNCE: framing contrastive learning in the context of maximizing mutual information between similar samples, we can derive the InfoNCE loss over a set containing one positive and negative ,