Q-Learning is an off-policy temporal difference learning method. Our goal is to fit , and instead of training on a dataset like Fitted Q-Iteration, we perform online updates. Traditionally, the tabular Q-learning algorithm performs

which directly approximates the optimal Q-function using some policy that generates tuples .

Note that the policy that generates these tuples must be soft, so we must use something like ๐Ÿ’ฐ Epsilon-Greedy. This algorithm is off-policy in that our behavioral policy explores and generates the training samples, and the Q-function implicitly defines the optimal greedy policy via the maximization operation.

Approximation

Beyond the tabular case, if we represent our Q-function as a neural network, our algorithm is as follows:

  1. Take action and observe .
  2. Set
  1. Update

Note that this update rule is a single gradient step from Fitted Q-Iteration. The loss weโ€™re optimizing is essentially

but we donโ€™t update the gradient until convergence to avoid overfitting to this single transition.

Exploration

Our final policy will be deterministic:

However, during learning, our actions should not come from a deterministic policy since if it was suboptimal, we wouldnโ€™t be exploring any better options. Thus, our goal during learning is to encourage exploration.

One common method is ๐Ÿ’ฐ Epsilon-Greedy, which takes the greedy action with probability and a random action otherwise.

Another method is to assign probabilities according to the values,

This is a somewhat โ€œsofterโ€ version of epsilon-greedy.

Target Network

Another issue is that in our update step, weโ€™re trying to fit to a target thatโ€™s also based on ; is this a โ€œmoving targetโ€ since any update would also change the target that we were trying to match in the first place.

Instead, we can keep another network, called the target network , that isnโ€™t updated as frequently. Our target is thus

and our update step proceeds the same as above. We occasionally update while training to keep the targets accurate, but with less frequent updates, we effectively avoid the moving target problem.

This simple update introduces an uneven amount of lag since a update right after the operation would have a recent for the target but other updates right before the target update would be matching a older (laggier) target. A popular alternative is to update the target at every step with

somewhat like Polyak averaging.

Info

Though usually itโ€™s not possible to linearly interpolate neural network weights, Polyak averaging offers some justification for why it works here, assuming that is similar to .

Applying this target network along with the replay buffer gives us ๐Ÿ‘พ Deep Q-Learning.

Multi-Step Returns

Finally, another improvement we can make to encourage faster convergence is by borrowing the idea for ๐Ÿชœ N-Step Bootstrappings. Since our Q function is hugely inaccurate at the start of training, we can do better by basing our target more on the actual rewards (our single-sample estimate). Thus, the improved target is

However, this theoretically requires our transitions to be on-policy. Practical implementations offer some heuristics that mitigate this issue, but it largely works well in practice.