In practice, one phenomenon we commonly observe in ๐ Q-Learning is that the estimates of our Q function are systematically higher than the actual value itโs supposed to representโthe expected discounted rewards over time. The reason for this is that in our target value calculation,
weโre taking the
Going a bit deeper, we rewrite the max operation as
In this form, itโs more obvious that the fundamental problem is that weโre selecting the best action via the argmax over the noisy estimate, and then weโre evaluating that action using the same noise. The immediate solution, then, is to avoid using the same noise to pick our argmax and max computation.
Double-Q Learning decorrelates the noise between action selection and evaluation by using separate networks,
to use one noisy estimate for argmax and another for max. Since the noise in
In practice, we can use โcurrentโ and โtargetโ networks (from ๐พ Deep Q-Learning) for this purpose: select the action with the current network