Proximal policy optimization (PPO) is a direct successor to ๐ฆ Trust Region Policy Optimization that uses a different methodology to deal with the KL constraint. We still follow the natural gradient objective
but unlike the ๐ Natural Policy Gradient or TRPO, PPO introduces two simple ways of dealing with the constraint. First, an adaptive penalty can be used to estimate the Lagrange multiplier for this optimization; second, we can implicitly enforce the constraint by de-incentivizing updates outside of it via clipping. The latter often performs better in practice and is usually the default version of PPO.
Adaptive Penalty
First, PPO with adaptive penalty directly incorporates the constraint into our objective by using a soft penalty; similar to the Lagrangian, weโll optimize
At every optimization step, if the constraint is violated more than
Clipping
Alternatively, PPO with clipping (the more popular variant) simply restricts the gradient itself by clipping the importance weights. For our original natural gradient objective
we can constrain
If an importance weight is outside the clipping trust region, the gradient is zero since we use
More specifically, for the clipping, weโll take a pessimistic approach and treat โgoodโ and โbadโ cases differently. These rules are also visualized below.
- In a โgoodโ case with positive advantage, weโll be cautious and clip the weight.
- In a โbadโ case with negative advantage, weโll allow large importance weights past the clipping to incentivize large updates that avoid the negative advantage. In other words, if
is much more likely to do something with negative advantage, we want to consider it when maximizing and not clip its importance.
Putting the clipping rules together, we get the following definition:
where