✌️ Double Q-Learning

In practice, one phenomenon we commonly observe in 🚀 Q-Learning is that the estimates of our Q function are systematically higher than the actual value it’s supposed to represent—the expected discounted rewards over time. The reason for this is that in our target value calculation,

we’re taking the . Since our is an estimate for the actual Q-value, if we take the max, we’re biased toward selecting positive noise, thereby causing an overestimation.

Going a bit deeper, we rewrite the max operation as

In this form, it’s more obvious that the fundamental problem is that we’re selecting the best action via the argmax over the noisy estimate, and then we’re evaluating that action using the same noise. The immediate solution, then, is to avoid using the same noise to pick our argmax and max computation.

Double-Q Learning decorrelates the noise between action selection and evaluation by using separate networks, and . Our update rule is slightly tweaked,

to use one noisy estimate for argmax and another for max. Since the noise in used to select action isn’t also in , our evaluation won’t give it an overestimated value. Note that this update is performed for both and (just by swapping and ).

In practice, we can use “current” and “target” networks (from 👾 Deep Q-Learning) for this purpose: select the action with the current network but evaluate it with the target network . Thus,

Explorer

✌️ Double Q-Learning

Backlinks

Graph View