Noise contrastive estimation1 (NCE) is a method for optimizing a probability model that predicts probabilities by directly approximating the partition function . We represent the modelโs distribution as
where , a parameter learned alongside in the model, approximates . Note that this resulting distribution is not an exactly valid probability distribution, but it will be close as our estimate for improves.
Naively asking our algorithm to approximate doesnโt guarantee that itโs equal to : to maximize probabilities, it can simply learn an extremely large and ignore the probability constraint.
Instead, weโll create a joint distribution consisting of samples from our data distribution and samples from some noise distribution . Specifically, for one sample from , weโll have noise samples; if we let indicate a true data sample and be noise, we have
and thus the joint is
Whereas our original goal is to learn some that resembles , NCE builds a binary classifier for on top of ,
Our goal is to minimize the โ๏ธ KL Divergence between the joint defined by our model and the joint data-noise distribution above, which is equivalent to maximizing the the NCE objective:
Minimizing gives us the optimal values for and . To do so, we must compute our probabilities; weโll need to know our noise distribution (which gives us ) and also estimate the partition with .
Unlike the naive case where can be chosen to explode our probabilities to infinity, appears in both the numerator and denominator hereโif our model predicts sky-high for any input, whether real or noise, then , which is good, but , which is wrong. Thus, this objective controls , forcing the model to predict high values of for real samples , thereby making , while predicting low values for noisy samples to make , which maximizes our objective.
Moreover, a closer analysis of the derivative of this objective shows that
which, as , approaches the gradient for the standard MLE,
Thus, with a large enough , optimizing our NCE objective moves us in the same direction as maximizing .