Noise contrastive estimation¹ (NCE) is a method for optimizing a probability model that predicts probabilities by directly approximating the partition function . We represent the model’s distribution as

where , a parameter learned alongside in the model, approximates . Note that this resulting distribution is not an exactly valid probability distribution, but it will be close as our estimate for improves.

Naively asking our algorithm to approximate doesn’t guarantee that it’s equal to : to maximize probabilities, it can simply learn an extremely large and ignore the probability constraint.

Contrastive Objective

Instead, we’ll create a joint distribution consisting of samples from our data distribution and samples from some noise distribution . Specifically, for one sample from , we’ll have noise samples; if we let indicate a true data sample and be noise, we have

and thus the joint is

Whereas our original goal is to learn some that resembles , NCE builds a binary classifier for on top of ,

Our goal is to minimize the ✂️ KL Divergence between the joint defined by our model and the joint data-noise distribution above, which is equivalent to maximizing the the NCE objective:

Minimizing gives us the optimal values for and . To do so, we must compute our probabilities; we’ll need to know our noise distribution (which gives us ) and also estimate the partition with .

Analysis and Intuition

Unlike the naive case where can be chosen to explode our probabilities to infinity, appears in both the numerator and denominator here—if our model predicts sky-high for any input, whether real or noise, then , which is good, but , which is wrong. Thus, this objective controls , forcing the model to predict high values of for real samples , thereby making , while predicting low values for noisy samples to make , which maximizes our objective.

Moreover, a closer analysis of the derivative of this objective shows that

which, as , approaches the gradient for the standard MLE,

Thus, with a large enough , optimizing our NCE objective moves us in the same direction as maximizing .

Log Odds Form

Finally, note that in practice, is usually set to , and our model learns the probabilities directly, . Also, with this setting, we can convert

If we have

(without the partition function, which was assumed to be with ), then this probability simplifies to

Plugging this (and ) back into the NCE objective we recover its log odds form,

Awesome explanation and connections with ℹ️ InfoNCE here. ↩

Explorer

📣 Noise Contrastive Estimation

Contrastive Objective

Analysis and Intuition

Log Odds Form

Table of Contents

Backlinks

Graph View

Explorer

📣 Noise Contrastive Estimation

Contrastive Objective §

Analysis and Intuition §

Log Odds Form §

Footnotes §

Table of Contents

Backlinks

Graph View

Contrastive Objective

Analysis and Intuition

Log Odds Form

Footnotes