N-step bootstrapping is an extension of โ๏ธ Temporal Difference Learning that incorporates the idea behind ๐ช Monte Carlo Control. Rather than relying on one step to perform bootstrapped updates, the key idea is to use a Monte Carlo estimate for
Formally, our
Our value function updates thus requires sampling for
for the state-value and
for the action-value.
This action-value update can be directly integrated with ๐งญ Sarsa. However, for off-policy methods like ๐ Q-Learning, we need to correct the Monte Carlo estimate with importance sampling; our importance weight is
Advantage Estimation
In ๐ Policy Gradient methods, we can use the same idea to estimate the advantage function
Formally, we have
where the last term is the baseline.
This method gives us a good bias-variance tradeoff in our estimate. Our single-sample estimate has extremely high variance in far-future cases because our trajectory can deviate more as we go further into the future, and our critic has lower variance further in the future due to discounted rewards. Thus, by using our single-sample for the near future and critic for the far future, we mitigate disadvantages from both.
Generalized Advantage Estimation
GAE finds a weighted average of multiple
We usually prefer smaller