Soft actor-critic (SAC) is a stochastic off-policy algorithm that employs ๐ฒ Entropy Regularization with ๐ญ Actor-Critic. Weโll learn a policy
Following the entropy regularized definitions for the value functions, we use samples from a replay buffer
To update a policy from our values, we move it toward an exponential function defined by the action-value by measuring the โdifferenceโ via โ๏ธ KL Divergence,
Since
which allows us to rewrite
which we can directly differentiate and minimize. Note that the partition term
An alternate interpretation of the above objective is to directly maximize our state-value: then, since
maximizing this is exactly the same as minimizing our objective
This update toward the exponential can be theoretically shown to improve our action-values in soft policy iteration, which motivates its use in SAC.
Twin-Q
In practice, to improve stability, we seek to counteract action-value estimation by training two Q-functions using the same
More recently, another variant of SAC ditches the state value entirely, instead optimizing the twin Q-functions via target Q-networks obtained by Polyak averaging past parameters. Our action-value objective is
where the target
and
Automatic Temperature
The entropy coefficient
We can consider each time step of this objective as a ๐ Constrained Optimization problem. For the final time step, we have the dual
where
Repeating this minimization going backwards in time is equivalent to finding a global temperature parameter that minimizes
which is the objective we can perform a gradient step on every iteration to automatically adjust our temperature.