Twin delayed DDPG (TD3) addresses training instability in ๐งจ DDPG by solving its failure mode: the deterministic policy exploiting inaccurate overestimations in the Q-function.
To this end, weโll introduce three tricks:
- Fit two (twin) Q-functions
and and use the smaller of the two in hopes that an overestimation in one function wonโt also occur in the other. - Delay updates to the policy, essentially training it less frequently than the Q-functions to allow for more accurate action-value estimates.
- Add clipped noise to the selected action, effectively smoothing out action-values and making it harder for the policy to exploit overestimations.
Formally, weโll first define our noised policy. We first add noise
Next, for training our Q-functions, we use the same target
and regress with samples from the replay buffer
for
After every two updates to the Q-function, we update the policy using the first Q-function
which is the exact same as DDPG.