Many ๐ Policy Gradient algorithms are on-policy, meaning that they require samples collected by running the same policy weโre optimizing for. Once we update our policy, we need to collect new samples.
The goal of the off-policy policy gradient is to be able to update our policy using samples not from our trajectory. Let
We start with the original objective and apply ๐ช Importance Sampling to rewrite the expectation in terms of the sample distribution
Next, for our policy gradient, we can apply the ๐ฆ Log Derivative Trick:
While this gradient can work on its own, there are a few adjustments we make for practicality. First, we can incorporate causality (policy at time
The difficulty here is that both the first and second products can be exponentially large, and we must find a way to reduce the variance of our estimates.
- We can ignore the second product, which is an importance weight on our future rewards, to recover a โป๏ธ Policy Iteration algorithm. This will still improve our policy, but we no longer have the actual gradientโhowever, an approximation will do.
- As for the second product, we can consider only using the final importance weight as an estimation. The reasoning behind this choice is that if we look at the state-action marginal (occupancy measure) formulation of our objective, we get
by assuming
In general, this is not the correct policy gradient. However, if