The derivative of log density with respect to its argument, , is called the score. Score matching tries to make the score of the modelโ€™s distribution equal to the score of the data distribution by minimizing the Fisher divergence

Since isnโ€™t a function of , . Thus, we avoid differentiating the partition function altogether.

However, itโ€™s not possible to directly find the score of the true distribution . A series of derivations from our original objective shows that we can minimize the following instead:

also written as

where is the Hessian. In most cases, we can perform โ›ฐ๏ธ Gradient Descent on this objective.

Ratio Matching

Since weโ€™re taking derivatives, score matching only works for continuous data. In the discrete case, specifically for binary data, ratio matching seeks to minimize

where returns with the th bit flipped. This was derived using the same idea as the ๐Ÿ™ƒ Pseudo-Likelihood: the ratio of two probabilities cancels out the partition function.

Denoising Score Matching

Finding the trace of the Hessian is computationally expensive, especially if is a neural network. We can avoid this computation with denoising score matching, which adds noise to our data distribution. Formally, our noisy distribution is

where is a corruption process.

Sampling for our objective instead of , we can derive

where is a constant that we can discard. If we let

then we can easily compute

Note that the main drawback here is that our model is now trained on a different distribution than our data, though it should be decently close if we have small . However, we have also gained the ability to denoise samples via Tweedieโ€™s formula,

Sliced Score Matching

An alternative to denoising score matching is sliced score matching, which trains our model on the data distribution at a comparable speed to the denoising method. The key insight is that if our scores match, their projections onto any direction would also be the same. Thus, we can choose a directions aim to minimize the sliced Fisher divergence

After some derivations, we have the objective

Unlike , computing

is scalable and can be done in a single backpropagation.