Temporal difference learning is a class of reinforcement learning algorithms that estimates the value function via bootstrapping and sampling. In a way, this is a combination of ๐งจ Dynamic Programming and ๐ช Monte Carlo Control.
Specifically, temporal difference methods learn after every time step by using the recursive nature of the value function. The most common example is TD(0).
TD(0)
TD(0) is a value estimation method that updates its value estimate after every action. Specifically, it performs
as a proxy for the ๐ Bellman Equation
This update step uses a single sample to estimate the expectation, and it also uses the current value function as an approximation for the actual value function.
The second term is called the TD error,
This quantity comes up often in other reinforcement learning methods.
Control
For control, we train action-values using the same methodology as the prediction equation above.
- ๐ Q-Learning is an off-policy algorithm that implicitly optimizes a greedy policy.
- ๐งญ Sarsa is an on-policy algorithm that performs TD updates with
-greedy.