Deep Q-Learning1 is a landmark โ™Ÿ๏ธ Reinforcement Learning algorithm that uses ๐Ÿš€ Q-Learning to train a ๐Ÿ‘๏ธ Convolutional Neural Networkโ€”termed Deep Q-Network (DQN). This takes advantage of the powerful function approximation abilities of CNNs and was originally used to achieve groundbreaking performance on Atari games.

On top of replacing (tabular approach) with (function approximation), we also use ๐Ÿ“บ Experience Replay: instead of performing Bellman backups for the immediate action we take, we instead store all transitions in a replay buffer and randomly sample some to update our weights . This avoids strong correlations between consecutive transitions and also increases sample-efficiency since a transition can be used for multiple updates.

The entire algorithm is as follows:

  1. Take some action with ๐Ÿ’ฐ Epsilon-Greedy and observe , add it to .
  2. Sample a mini-batch from .
  3. Compute
  1. Update parameters

Target Network

Unfortunately, Deep Q-Learning satisfies the function approximation, bootstrapping, and off-policy sampling conditions of the ๐Ÿ’€ Deadly Triad, making it prone to unstable or divergent training. One way to improve the function approximation component is to use a target network ,2 which maintains weights different from our value network , in our update instead:

In the algorithm above, depends on and thus will change once is updated. By using a target network with weights , our target wonโ€™t change after an update to . This network is a โ€œlaggingโ€ version of the value network, updated every steps so and remaining constant between the updates.

Dueling Architecture

Another improvement to the original DQN is a modification of the architecture. Whereas the original predicts Q-values for each pair, the dueling architecture3 decouples state-value from action-value, instead predicting the state value and advantage (and then using them to compute ). Intuitively, this allows the architecture to generalize information across pairs for the same state , and it can learn whether a state is valuable without considering the effect of potentially-irrelevant actions.

While the definition of the value functions has

this is unidentifiable since a value for doesnโ€™t have unique and . To address this, we compute

With an optimal policy, the average should instead be a max operator (since ), so this prediction is off by a constant. However, the average is more stable empirically.

With this computation, the dueling architecture can be plugged into any algorithm that uses Q-value estimates and offers the benefit of generalizing across actions.

Footnotes

  1. Playing Atari with Deep Reinforcement Learning (Mnih et al, 2013) โ†ฉ

  2. Human-Level Control Through Deep Reinforcement Learning (Mnih et al, 2015) โ†ฉ

  3. Dueling Network Architectures for Deep Reinforcement Learning (Wang et al, 2016) โ†ฉ