🎯 Advantage-Weighted Regression

Advantage-weighted regression (AWR) is an off-policy algorithm similar to ♻️ Policy Iteration that alternates between estimating the advantage and improving the policy using samples from a replay buffer .

To start, we use the approximate constrained policy search from 🚜 Natural Policy Gradients,

Our objective can be written in terms of advantage,

and we’ll approximate it using old state samples from instead of . Solving this approximated objective with the Lagrangian yields the optimal policy

For a parameterization of with parameters , we can then optimize the policy toward this optimum:

Though this derivation was in terms of a single , it can be generalized to a mixture of past policies that stores observations in a replay buffer. Then, is fit on all past policies via sampling from the buffer,

We thus have the two components needed to iteratively improve our policy. A complete AWR step is as follows:

Sample trajectories with and add it to .
Update by fitting it to rewards from using the equation above.
Update by fitting it with the exponential weight,

Explorer

🎯 Advantage-Weighted Regression

Backlinks

Graph View