The Monte Carlo policy gradient, commonly called REINFORCE, estimates the gradient at every time step using samples from the trajectory.

We start with the gradient equation

One key observation is that our policy at time doesnโ€™t affect reward at time for โ€”this is called causality. Thus, in our gradient computation, any reward in the past shouldnโ€™t be applied to an action in the future. We can reflect this by distributing the reward summation into the gradient summation and limiting the summation to only rewards in the future,

More succinctly, we can observe that the reward summation is exactly our return, so our gradient is

Baseline

The expectation above has a high variance due to reward. One observation is that our rewards can be arbitrary; if theyโ€™re all positive, the gradient for a bad trajectory would still increase its probability, even if by a little. Intuitively, we want the good trajectories to increase probability and bad trajectories to decrease probability, so we can introduce a baseline

Then, we would simply measure a trajectoryโ€™s reward relative to the baseline,

Note that introducing this constant doesnโ€™t change the expectation (with ) but decreases variance.

Optimal Baseline

Note that the average of the rewards is actually not the best baseline for minimal variance. A formal derivation shows that

Intuitively, this baseline is the expected reward weighted by gradient magnitudes; notably, this one has different values for each parameter of the gradient whereas the simple average uses the same value for all. In practice, however, we often use the average baseline just for simplicity.