The off-policy actor critic introduces a replay buffer that keeps track of all past tuples that uses this buffer to train its value function estimate rather than samples from the current policy’s trajectory.

In other to work with samples from a past policy, we need to modify our on-policy 🎭 Actor-Critic algorithm in two places.

First, the value estimate is incorrect since this estimate would be measuring the value of based on the action taken by the old policy, not the current one.
Second, our gradient step’s needs to be estimated as an expectation over the current policy, which requires some correction like in 🚑 Off-Policy Policy Gradient.

Value Estimation

To address the first problem, we introduce the action-value

which has no requirement that comes from our current policy. Thus, we’ll train a network to predict the Q-function instead, and our objective is

Note that is sampled from our current policy. This is essentially a single-sample estimate for

Action Sampling

As for the second issue, we’ll sample from our current policy and calculate the gradient

This makes a distinction between the action used to update our action-value estimate and the action used to update our policy; the former can use any action from the replay buffer while the latter must come from the current policy.

In practice, since computing advantage requires some estimate of the state-value, we use the action-value directly instead,

Though this would increase our variance by getting rid of the baseline, we can make up for this by simply sampling multiple times from our policy and running this update from each one—the key idea here is that can come directly from our policy and has no reliance on the actual environment.

Explorer

🎩 Off-Policy Actor-Critic

Value Estimation

Action Sampling

Table of Contents

Backlinks

Graph View

Explorer

🎩 Off-Policy Actor-Critic

Value Estimation §

Action Sampling §

Table of Contents

Backlinks

Graph View

Value Estimation

Action Sampling