The off-policy actor critic introduces a replay buffer that keeps track of all past tuples that uses this buffer to train its value function estimate rather than samples from the current policyโ€™s trajectory.

In other to work with samples from a past policy, we need to modify our on-policy ๐ŸŽญ Actor-Critic algorithm in two places.

  1. First, the value estimate is incorrect since this estimate would be measuring the value of based on the action taken by the old policy, not the current one.
  2. Second, our gradient stepโ€™s needs to be estimated as an expectation over the current policy, which requires some correction like in ๐Ÿš‘ Off-Policy Policy Gradient.

Value Estimation

To address the first problem, we introduce the action-value

which has no requirement that comes from our current policy. Thus, weโ€™ll train a network to predict the Q-function instead, and our objective is

Note that is sampled from our current policy. This is essentially a single-sample estimate for

Action Sampling

As for the second issue, weโ€™ll sample from our current policy and calculate the gradient

This makes a distinction between the action used to update our action-value estimate and the action used to update our policy; the former can use any action from the replay buffer while the latter must come from the current policy.

In practice, since computing advantage requires some estimate of the state-value, we use the action-value directly instead,

Though this would increase our variance by getting rid of the baseline, we can make up for this by simply sampling multiple times from our policy and running this update from each oneโ€”the key idea here is that can come directly from our policy and has no reliance on the actual environment.