Advantage-weighted regression (AWR) is an off-policy algorithm similar to โป๏ธ Policy Iteration that alternates between estimating the advantage and improving the policy using samples from a replay buffer
To start, we use the approximate constrained policy search from ๐ Natural Policy Gradients,
Our objective can be written in terms of advantage,
and weโll approximate it using old state samples
For a parameterization of
Though this derivation was in terms of a single
We thus have the two components needed to iteratively improve our policy. A complete AWR step is as follows:
- Sample trajectories
with and add it to . - Update
by fitting it to rewards from using the equation above. - Update
by fitting it with the exponential weight,