Trust region policy optimization (TRPO) is an improvement on the ๐ Natural Policy Gradient with some optimizations and new guarantees. It solves the same optimization problem,
using the same initial steps as the natural gradient:
- Approximate KL divergence (which defines our โtrust regionโ) as
- Follow the gradient
TRPO offers two main improvements to the standard natural policy gradient: improved Fisher matrix calculation and an improvement check via line search.
First, observe that we donโt actually need to directly invert
Rather, what we really want to compute is
for some
Second, since weโre making some approximations, the natural gradient may not actually improve our policy. TRPO double checks the update by performing line search: iteratively reducing the update size
Putting these additions together, we have the full TRPO algorithm. Weโll maintain a policy
- Collect
trajectories by running in the environment and save rewards . - Compute advantage estimates
based on our current value function . - Estimate the policy gradient
- Use the conjugate gradient algorithm to find
and also calculate
- Fit our value function on the new policy,