Twin delayed DDPG (TD3) addresses training instability in ๐Ÿงจ DDPG by solving its failure mode: the deterministic policy exploiting inaccurate overestimations in the Q-function.

To this end, weโ€™ll introduce three tricks:

  1. Fit two (twin) Q-functions and and use the smaller of the two in hopes that an overestimation in one function wonโ€™t also occur in the other.
  2. Delay updates to the policy, essentially training it less frequently than the Q-functions to allow for more accurate action-value estimates.
  3. Add clipped noise to the selected action, effectively smoothing out action-values and making it harder for the policy to exploit overestimations.

Formally, weโ€™ll first define our noised policy. We first add noise , clipped to the range , to the selected action, then clip it again to be within the valid action range :

Next, for training our Q-functions, we use the same target

and regress with samples from the replay buffer ,

for .

After every two updates to the Q-function, we update the policy using the first Q-function

which is the exact same as DDPG.