Deep deterministic policy gradient (DDPG) combines โš”๏ธ Deterministic Policy Gradient with ๐Ÿ‘พ Deep Q-Learning. While the DQN works in discrete action space, we can modify it for continuous actions by using a deterministic actor-critic that removes the need for the intractable max operation. DDPG borrows the replay buffer from DQN, but we also slightly deviate from the incremental target network updates by using extremely small updates to the target instead.

Formally, our setup is below:

  1. Weโ€™ll have the critic and actor along with targets and .
  2. Our exploration policy is a noisy version of our deterministic one,

A single time step of the algorithm is as follows:

  1. For state , select action . Execute and store in .
  2. Sample a minibatch from . Each tupleโ€™s target will be
  1. Update the critic by minimizing
  1. Update the actor with the DPG gradient
  1. Update the targets with an extremely small ,