Policy gradients is a class of reinforcement learning algorithms that directly optimize our objective,
where
We seek to find the gradient of our objective with respect to
Unfortunately, directly calculating
To actually calculate this value, we can further rewrite it with the ๐ฆ Log Derivative Trick, so
In the proof for this theorem, the constant of proportionality is the average length of an episode for the episodic case and equal to
Intuitively, this is a supervised learning update with the gradients scaled by the reward; thus if the reward is high, then our policy is more likely to perform this trajectory again. In a sense, this is simply โtrial and error.โ
The above formula is the core equation for many policy gradient algorithms; more generally, most can be expressed in the form
where