Eligibility traces bridges function approximation 🪙 Monte Carlo Control and ⌛️ Temporal Difference Learning by averaging 🪜 N-Step Bootstrapping. For -step return

the -return is defined as

This is a exponentially-weighted average with in front to normalize the sum. In the finite horizon, we let post-termination returns be the conventional return , so

Note that with , we get the one-step TD return, and with , we have a Monte Carlo estimate.

However, this formulation isn’t commonly used because it requires complex computations for the -step returns. We can instead use an eligibility trace that has the same dimension as our weights . This vector represents the magnitude of the update performed to at each time step, and it can viewed as a “backward” way of updating the weights rather than the “forward” return method above.

Specifically, we initialize , and at each transition,

The eligibility trace is then used to update our weights,

where is some error measure. From this formulation, we can see that the eligibility trace essentially captures the “recency” of updates to our weights; similar to -return, it places more emphasis on the most recent values.

TD()

TD() is the most common implementation of eligibility traces. Our error is set as the TD error

and our weight update can be performed at every time step, similar to standard TD(0). Observe that again, with , we (quite literally) get TD(0), and with , we get TD(1), a Monte Carlo variant that discounts rewards.

Explorer

🎫 Eligibility Traces

TD()

Backlinks

Graph View

Explorer

🎫 Eligibility Traces

TD() §

Backlinks

Graph View

TD()