InfoNCE is a loss that encourages a model to associate some “context” with samples . That is, we want to learn some embedding for and that maximizes their 🤝 Mutual Information,

In the original contrastive predictive coding (CPC) paper, was a context used to summarize a sequence history and consists of future predictions in the sequence. However, InfoNCE has since been generalized to many different domains that all share the goal of learning semantics in or (or both) that relate to each other.

More specifically, we’ll model the density ratio within the expectation,

Our goal is to make high for context and samples from our data distribution. To do so, we’ll take inspiration from 📣 Noise Contrastive Estimation and develop a contrastive loss between positive and negative samples.

Specifically, we’ll form a set consisting of a positive sample from and negative samples from some proposal distribution that’s independent of . The probability of being our positive sample, indicated by , is

From our final expression, we can substitute in our definition for and get

We want to learn that has high probability for true positive samples, so the InfoNCE loss optimizes the categorical 💧 Cross Entropy loss for classifying correctly,

Optimizing this loss thus gives us a that maximizes for sampled from our data distribution, which in turn maximizes mutual information. This objective is actually explicitly related to mutual information as

so another interpretation is that minimizing maximizes the mutual information lower bound.