Deep deterministic policy gradient (DDPG) combines โ๏ธ Deterministic Policy Gradient with ๐พ Deep Q-Learning. While the DQN works in discrete action space, we can modify it for continuous actions by using a deterministic actor-critic that removes the need for the intractable max operation. DDPG borrows the replay buffer
Formally, our setup is below:
- Weโll have the critic
and actor along with targets and . - Our exploration policy
is a noisy version of our deterministic one,
A single time step of the algorithm is as follows:
- For state
, select action . Execute and store in . - Sample a minibatch
from . Each tupleโs target will be
- Update the critic by minimizing
- Update the actor with the DPG gradient
- Update the targets with an extremely small
,