Actor-Critic is a ๐Ÿš“ Policy Gradient algorithm that augments the standard gradient calculation by also estimating value functions. The โ€œactorโ€ represents our policy gradient updates, and the โ€œcriticโ€ is our value function updates.

Actor

To motivate the formulation of our gradient, note that the reward summation in the standard policy gradient, after applying causation, can be better estimated with the Q-function. Thus, our gradient is

To reduce variance, we can incorporate a baseline: the average of over the state , which is exactly the definition of our value function

Thus, our gradient with the baseline gives us the actor-critic gradient:

is the advantage function, which is the difference between the expected reward of an action and the average reward of an action at . Intuitively, this gradient increases probabilities for actions that are above average and decreases those that are below average.

Note

Note that this formulation is extremely similar to โ™ป๏ธ Policy Iteration; the main difference is that weโ€™re performing a gradient ascent step on the gradient whereas policy iteration directly redefines the policy to the optimal value.

Critic

To find the advantage function, we note that

Thus, our Q-function and advantage function can both be determined by the value function,

or the following if we incorporate discount factor ,

Finding the value function estimate can be done via Bootstrap Estimate, either by training in batches or online updates after every move. Introducing this neural network to our policy gradient reduces its variance since the network predicts similar advantages for similar states, thereby generalizing our simple one-sample estimate in Monte Carlo policy gradient.

Parallel Actor-Critic

To stabilize training the value function, we can run multiple actor-critics at the same time with the same policy but different random seeds. Each instance makes online updates to the same neural network for .

Synchronized parallel actor-critic makes these updates together in a batch whereas asynchronous parallel actor-critic updates whenever an instance finishes. One slight theoretical drawback to the asynchronous version is that some instances might train the network on samples drawn using an old policy; however, practical performance benefits usually outweigh this drawback.