🚑 Off-Policy Policy Gradient

Many 🚓 Policy Gradient algorithms are on-policy, meaning that they require samples collected by running the same policy we’re optimizing for. Once we update our policy, we need to collect new samples.

The goal of the off-policy policy gradient is to be able to update our policy using samples not from our trajectory. Let be our desired target policy and be the policy that generated our samples.

We start with the original objective and apply 🪆 Importance Sampling to rewrite the expectation in terms of the sample distribution :

Next, for our policy gradient, we can apply the 🦄 Log Derivative Trick:

While this gradient can work on its own, there are a few adjustments we make for practicality. First, we can incorporate causality (policy at time doesn’t affect reward in the past) to get

The difficulty here is that both the first and second products can be exponentially large, and we must find a way to reduce the variance of our estimates.

We can ignore the second product, which is an importance weight on our future rewards, to recover a ♻️ Policy Iteration algorithm. This will still improve our policy, but we no longer have the actual gradient—however, an approximation will do.
As for the second product, we can consider only using the final importance weight as an estimation. The reasoning behind this choice is that if we look at the state-action marginal (occupancy measure) formulation of our objective, we get

by assuming and ignoring the gradient’s impact on . Our original trajectory-level estimate can then use only the importance weight for each state-action pair, giving us

In general, this is not the correct policy gradient. However, if is close to , the error is bounded, and this method avoids the exponential blowup.

Explorer

🚑 Off-Policy Policy Gradient

Backlinks

Graph View