The derivative of log density with respect to its argument, , is called the score. Score matching tries to make the score of the modelโs distribution equal to the score of the data distribution by minimizing the Fisher divergence
Since isnโt a function of , . Thus, we avoid differentiating the partition function altogether.
However, itโs not possible to directly find the score of the true distribution . A series of derivations from our original objective shows that we can minimize the following instead:
also written as
where is the Hessian. In most cases, we can perform โฐ๏ธ Gradient Descent on this objective.
Since weโre taking derivatives, score matching only works for continuous data. In the discrete case, specifically for binary data, ratio matching seeks to minimize
where returns with the th bit flipped. This was derived using the same idea as the ๐ Pseudo-Likelihood: the ratio of two probabilities cancels out the partition function.
Finding the trace of the Hessian is computationally expensive, especially if is a neural network. We can avoid this computation with denoising score matching, which adds noise to our data distribution. Formally, our noisy distribution is
where is a corruption process.
Sampling for our objective instead of , we can derive
where is a constant that we can discard. If we let
then we can easily compute
Note that the main drawback here is that our model is now trained on a different distribution than our data, though it should be decently close if we have small . However, we have also gained the ability to denoise samples via Tweedieโs formula,
An alternative to denoising score matching is sliced score matching, which trains our model on the data distribution at a comparable speed to the denoising method. The key insight is that if our scores match, their projections onto any direction would also be the same. Thus, we can choose a directions aim to minimize the sliced Fisher divergence
After some derivations, we have the objective
Unlike , computing
is scalable and can be done in a single backpropagation.