Many ๐Ÿš“ Policy Gradient algorithms are on-policy, meaning that they require samples collected by running the same policy weโ€™re optimizing for. Once we update our policy, we need to collect new samples.

The goal of the off-policy policy gradient is to be able to update our policy using samples not from our trajectory. Let be our desired target policy and be the policy that generated our samples.

We start with the original objective and apply ๐Ÿช† Importance Sampling to rewrite the expectation in terms of the sample distribution :

Next, for our policy gradient, we can apply the ๐Ÿฆ„ Log Derivative Trick:

While this gradient can work on its own, there are a few adjustments we make for practicality. First, we can incorporate causality (policy at time doesnโ€™t affect reward in the past) to get

The difficulty here is that both the first and second products can be exponentially large, and we must find a way to reduce the variance of our estimates.

  1. We can ignore the second product, which is an importance weight on our future rewards, to recover a โ™ป๏ธ Policy Iteration algorithm. This will still improve our policy, but we no longer have the actual gradientโ€”however, an approximation will do.
  2. As for the second product, we can consider only using the final importance weight as an estimation. The reasoning behind this choice is that if we look at the state-action marginal (occupancy measure) formulation of our objective, we get

by assuming and ignoring the gradientโ€™s impact on . Our original trajectory-level estimate can then use only the importance weight for each state-action pair, giving us

In general, this is not the correct policy gradient. However, if is close to , the error is bounded, and this method avoids the exponential blowup.