Policy gradients is a class of reinforcement learning algorithms that directly optimize our objective,

where

We seek to find the gradient of our objective with respect to and use โ›ฐ๏ธ Gradient Descent,

Unfortunately, directly calculating is impossible since the trajectories depend on the environmentโ€™s dynamics. However, we can derive another form, known as the policy gradient theorem:

To actually calculate this value, we can further rewrite it with the ๐Ÿฆ„ Log Derivative Trick, so

In the proof for this theorem, the constant of proportionality is the average length of an episode for the episodic case and equal to for the infinite horizon case. We can also arrive at an equality by โ€œcapturingโ€ this constant in the expectation by calculating over full trajectories ,

Intuitively, this is a supervised learning update with the gradients scaled by the reward; thus if the reward is high, then our policy is more likely to perform this trajectory again. In a sense, this is simply โ€œtrial and error.โ€

The above formula is the core equation for many policy gradient algorithms; more generally, most can be expressed in the form

where can be one of multiple choices: