The Monte Carlo policy gradient, commonly called REINFORCE, estimates the gradient at every time step using samples from the trajectory.
We start with the gradient equation
One key observation is that our policy at time
More succinctly, we can observe that the reward summation is exactly our return, so our gradient is
Baseline
The expectation above has a high variance due to reward. One observation is that our rewards can be arbitrary; if theyโre all positive, the gradient for a bad trajectory would still increase its probability, even if by a little. Intuitively, we want the good trajectories to increase probability and bad trajectories to decrease probability, so we can introduce a baseline
Then, we would simply measure a trajectoryโs reward relative to the baseline,
Note that introducing this constant doesnโt change the expectation (with
Optimal Baseline
Note that the average of the rewards is actually not the best baseline for minimal variance. A formal derivation shows that
Intuitively, this baseline is the expected reward weighted by gradient magnitudes; notably, this one has different values for each parameter of the gradient whereas the simple average uses the same value for all. In practice, however, we often use the average baseline just for simplicity.