Advantage-weighted regression (AWR) is an off-policy algorithm similar to โ™ป๏ธ Policy Iteration that alternates between estimating the advantage and improving the policy using samples from a replay buffer .

To start, we use the approximate constrained policy search from ๐Ÿšœ Natural Policy Gradients,

Our objective can be written in terms of advantage,

and weโ€™ll approximate it using old state samples from instead of . Solving this approximated objective with the Lagrangian yields the optimal policy

For a parameterization of with parameters , we can then optimize the policy toward this optimum:

Though this derivation was in terms of a single , it can be generalized to a mixture of past policies that stores observations in a replay buffer. Then, is fit on all past policies via sampling from the buffer,

We thus have the two components needed to iteratively improve our policy. A complete AWR step is as follows:

  1. Sample trajectories with and add it to .
  2. Update by fitting it to rewards from using the equation above.
  3. Update by fitting it with the exponential weight,