✌️ TD3

Twin delayed DDPG (TD3) addresses training instability in 🧨 DDPG by solving its failure mode: the deterministic policy exploiting inaccurate overestimations in the Q-function.

To this end, we’ll introduce three tricks:

Fit two (twin) Q-functions and and use the smaller of the two in hopes that an overestimation in one function won’t also occur in the other.
Delay updates to the policy, essentially training it less frequently than the Q-functions to allow for more accurate action-value estimates.
Add clipped noise to the selected action, effectively smoothing out action-values and making it harder for the policy to exploit overestimations.

Formally, we’ll first define our noised policy. We first add noise , clipped to the range , to the selected action, then clip it again to be within the valid action range :

Next, for training our Q-functions, we use the same target

and regress with samples from the replay buffer ,

for .

After every two updates to the Q-function, we update the policy using the first Q-function

which is the exact same as DDPG.

Explorer

✌️ TD3

Backlinks

Graph View