This is a exponentially-weighted average with in front to normalize the sum. In the finite horizon, we let post-termination returns be the conventional return , so
Note that with , we get the one-step TD return, and with , we have a Monte Carlo estimate.
However, this formulation isnโt commonly used because it requires complex computations for the -step returns. We can instead use an eligibility trace that has the same dimension as our weights . This vector represents the magnitude of the update performed to at each time step, and it can viewed as a โbackwardโ way of updating the weights rather than the โforwardโ return method above.
Specifically, we initialize , and at each transition,
The eligibility trace is then used to update our weights,
where is some error measure. From this formulation, we can see that the eligibility trace essentially captures the โrecencyโ of updates to our weights; similar to -return, it places more emphasis on the most recent values.
TD() is the most common implementation of eligibility traces. Our error is set as the TD error
and our weight update can be performed at every time step, similar to standard TD(0). Observe that again, with , we (quite literally) get TD(0), and with , we get TD(1), a Monte Carlo variant that discounts rewards.