Eligibility traces bridges function approximation ๐Ÿช™ Monte Carlo Control and โŒ›๏ธ Temporal Difference Learning by averaging ๐Ÿชœ N-Step Bootstrapping. For -step return

the -return is defined as

This is a exponentially-weighted average with in front to normalize the sum. In the finite horizon, we let post-termination returns be the conventional return , so

Note that with , we get the one-step TD return, and with , we have a Monte Carlo estimate.

However, this formulation isnโ€™t commonly used because it requires complex computations for the -step returns. We can instead use an eligibility trace that has the same dimension as our weights . This vector represents the magnitude of the update performed to at each time step, and it can viewed as a โ€œbackwardโ€ way of updating the weights rather than the โ€œforwardโ€ return method above.

Specifically, we initialize , and at each transition,

The eligibility trace is then used to update our weights,

where is some error measure. From this formulation, we can see that the eligibility trace essentially captures the โ€œrecencyโ€ of updates to our weights; similar to -return, it places more emphasis on the most recent values.

TD()

TD() is the most common implementation of eligibility traces. Our error is set as the TD error

and our weight update can be performed at every time step, similar to standard TD(0). Observe that again, with , we (quite literally) get TD(0), and with , we get TD(1), a Monte Carlo variant that discounts rewards.