Q-Learning is an off-policy temporal difference learning method. Our goal is to fit , and instead of training on a dataset like Fitted Q-Iteration, we perform online updates. Traditionally, the tabular Q-learning algorithm performs

which directly approximates the optimal Q-function using some policy that generates tuples .

Note that the policy that generates these tuples must be soft, so we must use something like 💰 Epsilon-Greedy. This algorithm is off-policy in that our behavioral policy explores and generates the training samples, and the Q-function implicitly defines the optimal greedy policy via the maximization operation.

Approximation

Beyond the tabular case, if we represent our Q-function as a neural network, our algorithm is as follows:

Take action and observe .
Set

Update

Note that this update rule is a single gradient step from Fitted Q-Iteration. The loss we’re optimizing is essentially

but we don’t update the gradient until convergence to avoid overfitting to this single transition.

Exploration

Our final policy will be deterministic:

However, during learning, our actions should not come from a deterministic policy since if it was suboptimal, we wouldn’t be exploring any better options. Thus, our goal during learning is to encourage exploration.

One common method is 💰 Epsilon-Greedy, which takes the greedy action with probability and a random action otherwise.

Another method is to assign probabilities according to the values,

This is a somewhat “softer” version of epsilon-greedy.

Target Network

Another issue is that in our update step, we’re trying to fit to a target that’s also based on ; is this a “moving target” since any update would also change the target that we were trying to match in the first place.

Instead, we can keep another network, called the target network , that isn’t updated as frequently. Our target is thus

and our update step proceeds the same as above. We occasionally update while training to keep the targets accurate, but with less frequent updates, we effectively avoid the moving target problem.

This simple update introduces an uneven amount of lag since a update right after the operation would have a recent for the target but other updates right before the target update would be matching a older (laggier) target. A popular alternative is to update the target at every step with

somewhat like Polyak averaging.

Info

Though usually it’s not possible to linearly interpolate neural network weights, Polyak averaging offers some justification for why it works here, assuming that is similar to .

Applying this target network along with the replay buffer gives us 👾 Deep Q-Learning.

Multi-Step Returns

Finally, another improvement we can make to encourage faster convergence is by borrowing the idea for 🪜 N-Step Bootstrappings. Since our Q function is hugely inaccurate at the start of training, we can do better by basing our target more on the actual rewards (our single-sample estimate). Thus, the improved target is

However, this theoretically requires our transitions to be on-policy. Practical implementations offer some heuristics that mitigate this issue, but it largely works well in practice.

Explorer

🚀 Q-Learning

Approximation

Exploration

Target Network

Multi-Step Returns

Table of Contents

Backlinks

Graph View

Explorer

🚀 Q-Learning

Approximation §

Exploration §

Target Network §

Multi-Step Returns §

Table of Contents

Backlinks

Graph View

Approximation

Exploration

Target Network

Multi-Step Returns