Q-Learning is an off-policy temporal difference learning method. Our goal is to fit
which directly approximates the optimal Q-function using some policy that generates tuples
Note that the policy that generates these tuples must be soft, so we must use something like ๐ฐ Epsilon-Greedy. This algorithm is off-policy in that our behavioral policy explores and generates the training samples, and the Q-function implicitly defines the optimal greedy policy via the maximization operation.
Approximation
Beyond the tabular case, if we represent our Q-function as a neural network, our algorithm is as follows:
- Take action
and observe . - Set
- Update
Note that this update rule is a single gradient step from Fitted Q-Iteration. The loss weโre optimizing is essentially
but we donโt update the gradient until convergence to avoid overfitting to this single transition.
Exploration
Our final policy will be deterministic:
However, during learning, our actions
One common method is ๐ฐ Epsilon-Greedy, which takes the greedy action with probability
Another method is to assign probabilities according to the
This is a somewhat โsofterโ version of epsilon-greedy.
Target Network
Another issue is that in our update step, weโre trying to fit
Instead, we can keep another network, called the target network
and our update step proceeds the same as above. We occasionally update
This simple update introduces an uneven amount of lag since a update right after the
somewhat like Polyak averaging.
Info
Though usually itโs not possible to linearly interpolate neural network weights, Polyak averaging offers some justification for why it works here, assuming that
is similar to .
Applying this target network along with the replay buffer gives us ๐พ Deep Q-Learning.
Multi-Step Returns
Finally, another improvement we can make to encourage faster convergence is by borrowing the idea for ๐ช N-Step Bootstrappings. Since our Q function is hugely inaccurate at the start of training, we can do better by basing our target more on the actual rewards (our single-sample estimate). Thus, the improved target is
However, this theoretically requires our transitions to be on-policy. Practical implementations offer some heuristics that mitigate this issue, but it largely works well in practice.