In practice, one phenomenon we commonly observe in ๐Ÿš€ Q-Learning is that the estimates of our Q function are systematically higher than the actual value itโ€™s supposed to representโ€”the expected discounted rewards over time. The reason for this is that in our target value calculation,

weโ€™re taking the . Since our is an estimate for the actual Q-value, if we take the max, weโ€™re biased toward selecting positive noise, thereby causing an overestimation.

Going a bit deeper, we rewrite the max operation as

In this form, itโ€™s more obvious that the fundamental problem is that weโ€™re selecting the best action via the argmax over the noisy estimate, and then weโ€™re evaluating that action using the same noise. The immediate solution, then, is to avoid using the same noise to pick our argmax and max computation.

Double-Q Learning decorrelates the noise between action selection and evaluation by using separate networks, and . Our update rule is slightly tweaked,

to use one noisy estimate for argmax and another for max. Since the noise in used to select action isnโ€™t also in , our evaluation wonโ€™t give it an overestimated value. Note that this update is performed for both and (just by swapping and ).

In practice, we can use โ€œcurrentโ€ and โ€œtargetโ€ networks (from ๐Ÿ‘พ Deep Q-Learning) for this purpose: select the action with the current network but evaluate it with the target network . Thus,