Deterministic policy gradient (DPG) uses a deterministic policy
which has (by the chain rule) the gradient
Theoretically, optimizing this policy would require less samples since we wouldn’t need to check the data over the entire state and action space. This gradient can be directly used in most policy gradient frameworks with the update
However, usually a deterministic policy doesn’t perform exploration, so we can use another behavior policy
Note that we don’t need importance sampling since there are no other actions to consider besides the deterministic one.