The natural gradient is a โ›ฐ๏ธ Gradient Descent procedure that uses a different constraint for the optimization step. Gradient descent finds

That is, the constraint for standard gradient descent is defined by the norm in parameter space. Sometimes (such as with ๐Ÿšœ Natural Policy Gradient), we care more about the constraint in the probability space defined by . That is, we want to solve the following instead:

To solve this, we first need to make an approximation. Itโ€™s difficult to directly use the โœ‚๏ธ KL Divergence constraint in our optimization, but we can substitute

where is the Fisher-information matrix

We can approximate via samples from our distribution.

Then, solving the Lagrangian, we get the gradient step

where is our Lagrange multiplier. We can pick , analogous to the learning rate, or compute it as

if we decide to set .