Proximal policy optimization (PPO) is a direct successor to 🏦 Trust Region Policy Optimization that uses a different methodology to deal with the KL constraint. We still follow the natural gradient objective

but unlike the 🚜 Natural Policy Gradient or TRPO, PPO introduces two simple ways of dealing with the constraint. First, an adaptive penalty can be used to estimate the Lagrange multiplier for this optimization; second, we can implicitly enforce the constraint by de-incentivizing updates outside of it via clipping. The latter often performs better in practice and is usually the default version of PPO.

Adaptive Penalty

First, PPO with adaptive penalty directly incorporates the constraint into our objective by using a soft penalty; similar to the Lagrangian, we’ll optimize

At every optimization step, if the constraint is violated more than , we increase . If the constraint is violated less than , we decrease . This doesn’t strictly enforce the KL constraint, but it maintains the same spirit—make small changes to .

Clipping

Alternatively, PPO with clipping (the more popular variant) simply restricts the gradient itself by clipping the importance weights. For our original natural gradient objective

we can constrain to be within and —note that is an implicit restriction for the KL constraint, not the same as the actual in the inequality.

If an importance weight is outside the clipping trust region, the gradient is zero since we use or instead; thus, the advantages achieved by going outside our trust region are effectively ignored, which results in closer updates. An equivalent interpretation is that if we maximize , finding some with huge importance weight on high advantages gives us the same value as finding an update that makes the importance weight —after clipping, both weights are the same, so there is no incentive for larger updates.

More specifically, for the clipping, we’ll take a pessimistic approach and treat “good” and “bad” cases differently. These rules are also visualized below.

In a “good” case with positive advantage, we’ll be cautious and clip the weight.
In a “bad” case with negative advantage, we’ll allow large importance weights past the clipping to incentivize large updates that avoid the negative advantage. In other words, if is much more likely to do something with negative advantage, we want to consider it when maximizing and not clip its importance.

Putting the clipping rules together, we get the following definition:

where is the importance weight. We can now simply perform gradient ascent on to optimize our policy.

Explorer

📪 Proximal Policy Optimization

Adaptive Penalty

Clipping

Table of Contents

Backlinks

Graph View

Explorer

📪 Proximal Policy Optimization

Adaptive Penalty §

Clipping §

Table of Contents

Backlinks

Graph View

Adaptive Penalty

Clipping