Actor-Critic is a ๐ Policy Gradient algorithm that augments the standard gradient calculation by also estimating value functions. The โactorโ represents our policy gradient updates, and the โcriticโ is our value function updates.
Actor
To motivate the formulation of our gradient, note that the reward summation in the standard policy gradient, after applying causation, can be better estimated with the Q-function. Thus, our gradient is
To reduce variance, we can incorporate a baseline: the average of
Thus, our gradient with the baseline gives us the actor-critic gradient:
Note
Note that this formulation is extremely similar to โป๏ธ Policy Iteration; the main difference is that weโre performing a gradient ascent step on the gradient whereas policy iteration directly redefines the policy to the optimal value.
Critic
To find the advantage function, we note that
Thus, our Q-function and advantage function can both be determined by the value function,
or the following if we incorporate discount factor
Finding the value function estimate
Parallel Actor-Critic
To stabilize training the value function, we can run multiple actor-critics at the same time with the same policy
Synchronized parallel actor-critic makes these updates together in a batch whereas asynchronous parallel actor-critic updates whenever an instance finishes. One slight theoretical drawback to the asynchronous version is that some instances might train the network on samples drawn using an old policy; however, practical performance benefits usually outweigh this drawback.