Deterministic policy gradient (DPG) uses a deterministic policy

which has (by the chain rule) the gradient

Theoretically, optimizing this policy would require less samples since we wouldn’t need to check the data over the entire state and action space. This gradient can be directly used in most policy gradient frameworks with the update

However, usually a deterministic policy doesn’t perform exploration, so we can use another behavior policy for sampling. Then, our gradient is

Note that we don’t need importance sampling since there are no other actions to consider besides the deterministic one.