Deep Q-Learning¹ is a landmark ♟️ Reinforcement Learning algorithm that uses 🚀 Q-Learning to train a 👁️ Convolutional Neural Network—termed Deep Q-Network (DQN). This takes advantage of the powerful function approximation abilities of CNNs and was originally used to achieve groundbreaking performance on Atari games.

On top of replacing (tabular approach) with (function approximation), we also use 📺 Experience Replay: instead of performing Bellman backups for the immediate action we take, we instead store all transitions in a replay buffer and randomly sample some to update our weights . This avoids strong correlations between consecutive transitions and also increases sample-efficiency since a transition can be used for multiple updates.

The entire algorithm is as follows:

Take some action with 💰 Epsilon-Greedy and observe , add it to .
Sample a mini-batch from .
Compute

Update parameters

Target Network

Unfortunately, Deep Q-Learning satisfies the function approximation, bootstrapping, and off-policy sampling conditions of the 💀 Deadly Triad, making it prone to unstable or divergent training. One way to improve the function approximation component is to use a target network ,² which maintains weights different from our value network , in our update instead:

In the algorithm above, depends on and thus will change once is updated. By using a target network with weights , our target won’t change after an update to . This network is a “lagging” version of the value network, updated every steps so and remaining constant between the updates.

Dueling Architecture

Another improvement to the original DQN is a modification of the architecture. Whereas the original predicts Q-values for each pair, the dueling architecture³ decouples state-value from action-value, instead predicting the state value and advantage (and then using them to compute ). Intuitively, this allows the architecture to generalize information across pairs for the same state , and it can learn whether a state is valuable without considering the effect of potentially-irrelevant actions.

While the definition of the value functions has

this is unidentifiable since a value for doesn’t have unique and . To address this, we compute

With an optimal policy, the average should instead be a max operator (since ), so this prediction is off by a constant. However, the average is more stable empirically.

With this computation, the dueling architecture can be plugged into any algorithm that uses Q-value estimates and offers the benefit of generalizing across actions.

Explorer

👾 Deep Q-Learning

Target Network

Dueling Architecture

Table of Contents

Backlinks

Graph View

Explorer

👾 Deep Q-Learning

Target Network §

Dueling Architecture §

Footnotes §

Table of Contents

Backlinks

Graph View

Target Network

Dueling Architecture

Footnotes