Trust region policy optimization (TRPO) is an improvement on the ๐Ÿšœ Natural Policy Gradient with some optimizations and new guarantees. It solves the same optimization problem,

using the same initial steps as the natural gradient:

  1. Approximate KL divergence (which defines our โ€œtrust regionโ€) as
  1. Follow the gradient

TRPO offers two main improvements to the standard natural policy gradient: improved Fisher matrix calculation and an improvement check via line search.

First, observe that we donโ€™t actually need to directly invert for our gradient descent step

Rather, what we really want to compute is . Reformulating this problem, we can solve

for some , which can be done via conjugate gradients. Note that weโ€™ll still need to find for the step size calculation, but we can avoid the expensive inverse calculation.

Second, since weโ€™re making some approximations, the natural gradient may not actually improve our policy. TRPO double checks the update by performing line search: iteratively reducing the update size until and , thus guaranteeing an improvement.

Putting these additions together, we have the full TRPO algorithm. Weโ€™ll maintain a policy as well as a value function for advantage estimates in the policy gradient. A single improvement step is as follows:

  1. Collect trajectories by running in the environment and save rewards .
  2. Compute advantage estimates based on our current value function .
  3. Estimate the policy gradient
  1. Use the conjugate gradient algorithm to find

and also calculate normally. 5. Update the policy with line search: find the smallest such that and for new parameters

  1. Fit our value function on the new policy,