Energy

The energy function returns a scalar that measures the compatibility between features of . The lower the energy, the “better” is.

Info

We can also use the energy function to model . Unlike feed-forward models that explicitly compute from , the energy function implicitly models their dependencies. By doing this, it’s possible to find multiple that have high compatibility with , which cannot be done with an explicit model.

Probability Distribution

To convert to a probability, we use the Gibbs-Boltzmann distribution,

where partition function ensures that we get a valid distribution. is the reciprocal of the temperature; we often let , but if we manually tune it, the distribution becomes more sharp as goes to infinity.

Info

This choice of this distribution is not arbitrary: since we don’t have any constraints on the system, we want to use a distribution that has maximum entropy. Solving the optimization problem gives us the above distribution.

Optimization

With our parameterization, we cannot directly optimize since we can’t calculate it. Moreover, maximizing doesn’t guarantee increasing the likelihood of our data since it’s not normalized, so any optimization step may be increasing more than .

The solution to this problem is 🖖 Contrastive Divergence. The core idea is to sample and take a step in the direction of

Intuitively, this makes the training data more likely than a random sample from the model.