N-step bootstrapping is an extension of โŒ›๏ธ Temporal Difference Learning that incorporates the idea behind ๐Ÿช™ Monte Carlo Control. Rather than relying on one step to perform bootstrapped updates, the key idea is to use a Monte Carlo estimate for steps, then finish the rest with a bootstrapped estimate.

Formally, our -step return can be written as

Our value function updates thus requires sampling for time steps, then computing

for the state-value and

for the action-value.

This action-value update can be directly integrated with ๐Ÿงญ Sarsa. However, for off-policy methods like ๐Ÿš€ Q-Learning, we need to correct the Monte Carlo estimate with importance sampling; our importance weight is where is our target policy and is our behavioral policy. Our updates are then as follows:

Advantage Estimation

In ๐Ÿš“ Policy Gradient methods, we can use the same idea to estimate the advantage function . Like above, we can use the single-sample estimate for the next time-steps, then rely on the critic for everything else.

Formally, we have

where the last term is the baseline.

This method gives us a good bias-variance tradeoff in our estimate. Our single-sample estimate has extremely high variance in far-future cases because our trajectory can deviate more as we go further into the future, and our critic has lower variance further in the future due to discounted rewards. Thus, by using our single-sample for the near future and critic for the far future, we mitigate disadvantages from both.

Generalized Advantage Estimation

GAE finds a weighted average of multiple -step return estimators with distinct ,

We usually prefer smaller (โ€œcuttingโ€ earlier) to reduce the variance of our single-sample estimate, so we set weights

is extremely similar to the discount term, and if we choose our weights in this way, we can simplify our equation to