Noise contrastive estimation1 (NCE) is a method for optimizing a probability model that predicts probabilities by directly approximating the partition function . We represent the modelโ€™s distribution as

where , a parameter learned alongside in the model, approximates . Note that this resulting distribution is not an exactly valid probability distribution, but it will be close as our estimate for improves.

Naively asking our algorithm to approximate doesnโ€™t guarantee that itโ€™s equal to : to maximize probabilities, it can simply learn an extremely large and ignore the probability constraint.

Contrastive Objective

Instead, weโ€™ll create a joint distribution consisting of samples from our data distribution and samples from some noise distribution . Specifically, for one sample from , weโ€™ll have noise samples; if we let indicate a true data sample and be noise, we have

and thus the joint is

Whereas our original goal is to learn some that resembles , NCE builds a binary classifier for on top of ,

Our goal is to minimize the โœ‚๏ธ KL Divergence between the joint defined by our model and the joint data-noise distribution above, which is equivalent to maximizing the the NCE objective:

Minimizing gives us the optimal values for and . To do so, we must compute our probabilities; weโ€™ll need to know our noise distribution (which gives us ) and also estimate the partition with .

Analysis and Intuition

Unlike the naive case where can be chosen to explode our probabilities to infinity, appears in both the numerator and denominator hereโ€”if our model predicts sky-high for any input, whether real or noise, then , which is good, but , which is wrong. Thus, this objective controls , forcing the model to predict high values of for real samples , thereby making , while predicting low values for noisy samples to make , which maximizes our objective.

Moreover, a closer analysis of the derivative of this objective shows that

which, as , approaches the gradient for the standard MLE,

Thus, with a large enough , optimizing our NCE objective moves us in the same direction as maximizing .

Log Odds Form

Finally, note that in practice, is usually set to , and our model learns the probabilities directly, . Also, with this setting, we can convert

If we have

(without the partition function, which was assumed to be with ), then this probability simplifies to

Plugging this (and ) back into the NCE objective we recover its log odds form,

Footnotes

  1. Awesome explanation and connections with โ„น๏ธ InfoNCE here. โ†ฉ