Soft actor-critic (SAC) is a stochastic off-policy algorithm that employs ๐ŸŽฒ Entropy Regularization with ๐ŸŽญ Actor-Critic. Weโ€™ll learn a policy , action-value , and state-value . Theoretically, we can estimate the state value from action value, but learning it stabilizes training in practice.

Following the entropy regularized definitions for the value functions, we use samples from a replay buffer to optimize the following for a given policy:

To update a policy from our values, we move it toward an exponential function defined by the action-value by measuring the โ€œdifferenceโ€ via โœ‚๏ธ KL Divergence,

Since is implemented as a differentiable neural network, we can optimize this with the ๐Ÿช„ Reparameterization Trick; that is, weโ€™ll parameterize the policy as

which allows us to rewrite

which we can directly differentiate and minimize. Note that the partition term was ignored since it doesnโ€™t contribute to the gradient.

An alternate interpretation of the above objective is to directly maximize our state-value: then, since

maximizing this is exactly the same as minimizing our objective above.

This update toward the exponential can be theoretically shown to improve our action-values in soft policy iteration, which motivates its use in SAC.

Twin-Q

In practice, to improve stability, we seek to counteract action-value estimation by training two Q-functions using the same objective (similar to โœŒ๏ธ TD3). We then use the minimum of the action value estimates for the value gradient and policy gradient .

More recently, another variant of SAC ditches the state value entirely, instead optimizing the twin Q-functions via target Q-networks obtained by Polyak averaging past parameters. Our action-value objective is

where the target

and denotes target Q-functions.

Automatic Temperature

The entropy coefficient is referred to as โ€œtemperature,โ€ and while we can set it manually as a hyperparameter, it can be better to manually define the desired minimum entropy instead and automatically compute . In this setting, our policyโ€™s objective is

We can consider each time step of this objective as a ๐Ÿ‘  Constrained Optimization problem. For the final time step, we have the dual

where is now a Lagrange multiplier. We can first find the optimal in terms of , then find the optimal . Then, going backwards one step, we observe that once again, the optimal can be expressed as a similar minimization,

Repeating this minimization going backwards in time is equivalent to finding a global temperature parameter that minimizes

which is the objective we can perform a gradient step on every iteration to automatically adjust our temperature.