The off-policy actor critic introduces a replay buffer
In other to work with samples from a past policy, we need to modify our on-policy ๐ญ Actor-Critic algorithm in two places.
- First, the value estimate
is incorrect since this estimate would be measuring the value of based on the action taken by the old policy, not the current one. - Second, our gradient stepโs
needs to be estimated as an expectation over the current policy, which requires some correction like in ๐ Off-Policy Policy Gradient.
Value Estimation
To address the first problem, we introduce the action-value
which has no requirement that
Note that
Action Sampling
As for the second issue, weโll sample
This makes a distinction between the action used to update our action-value estimate and the action used to update our policy; the former can use any action from the replay buffer
In practice, since computing advantage requires some estimate of the state-value, we use the action-value directly instead,
Though this would increase our variance by getting rid of the baseline, we can make up for this by simply sampling