Proximal policy optimization (PPO) is a direct successor to ๐Ÿฆ Trust Region Policy Optimization that uses a different methodology to deal with the KL constraint. We still follow the natural gradient objective

but unlike the ๐Ÿšœ Natural Policy Gradient or TRPO, PPO introduces two simple ways of dealing with the constraint. First, an adaptive penalty can be used to estimate the Lagrange multiplier for this optimization; second, we can implicitly enforce the constraint by de-incentivizing updates outside of it via clipping. The latter often performs better in practice and is usually the default version of PPO.

Adaptive Penalty

First, PPO with adaptive penalty directly incorporates the constraint into our objective by using a soft penalty; similar to the Lagrangian, weโ€™ll optimize

At every optimization step, if the constraint is violated more than , we increase . If the constraint is violated less than , we decrease . This doesnโ€™t strictly enforce the KL constraint, but it maintains the same spiritโ€”make small changes to .

Clipping

Alternatively, PPO with clipping (the more popular variant) simply restricts the gradient itself by clipping the importance weights. For our original natural gradient objective

we can constrain to be within and โ€”note that is an implicit restriction for the KL constraint, not the same as the actual in the inequality.

If an importance weight is outside the clipping trust region, the gradient is zero since we use or instead; thus, the advantages achieved by going outside our trust region are effectively ignored, which results in closer updates. An equivalent interpretation is that if we maximize , finding some with huge importance weight on high advantages gives us the same value as finding an update that makes the importance weight โ€”after clipping, both weights are the same, so there is no incentive for larger updates.

More specifically, for the clipping, weโ€™ll take a pessimistic approach and treat โ€œgoodโ€ and โ€œbadโ€ cases differently. These rules are also visualized below.

  1. In a โ€œgoodโ€ case with positive advantage, weโ€™ll be cautious and clip the weight.
  2. In a โ€œbadโ€ case with negative advantage, weโ€™ll allow large importance weights past the clipping to incentivize large updates that avoid the negative advantage. In other words, if is much more likely to do something with negative advantage, we want to consider it when maximizing and not clip its importance.

Putting the clipping rules together, we get the following definition:

where is the importance weight. We can now simply perform gradient ascent on to optimize our policy.