Soft actor-critic (SAC) is a stochastic off-policy algorithm that employs 🎲 Entropy Regularization with 🎭 Actor-Critic. We’ll learn a policy , action-value , and state-value . Theoretically, we can estimate the state value from action value, but learning it stabilizes training in practice.

Following the entropy regularized definitions for the value functions, we use samples from a replay buffer to optimize the following for a given policy:

To update a policy from our values, we move it toward an exponential function defined by the action-value by measuring the “difference” via ✂️ KL Divergence,

Since is implemented as a differentiable neural network, we can optimize this with the 🪄 Reparameterization Trick; that is, we’ll parameterize the policy as

which allows us to rewrite

which we can directly differentiate and minimize. Note that the partition term was ignored since it doesn’t contribute to the gradient.

An alternate interpretation of the above objective is to directly maximize our state-value: then, since

maximizing this is exactly the same as minimizing our objective above.

This update toward the exponential can be theoretically shown to improve our action-values in soft policy iteration, which motivates its use in SAC.

Twin-Q

In practice, to improve stability, we seek to counteract action-value estimation by training two Q-functions using the same objective (similar to ✌️ TD3). We then use the minimum of the action value estimates for the value gradient and policy gradient .

More recently, another variant of SAC ditches the state value entirely, instead optimizing the twin Q-functions via target Q-networks obtained by Polyak averaging past parameters. Our action-value objective is

where the target

and denotes target Q-functions.

Automatic Temperature

The entropy coefficient is referred to as “temperature,” and while we can set it manually as a hyperparameter, it can be better to manually define the desired minimum entropy instead and automatically compute . In this setting, our policy’s objective is

We can consider each time step of this objective as a 👠 Constrained Optimization problem. For the final time step, we have the dual

where is now a Lagrange multiplier. We can first find the optimal in terms of , then find the optimal . Then, going backwards one step, we observe that once again, the optimal can be expressed as a similar minimization,

Repeating this minimization going backwards in time is equivalent to finding a global temperature parameter that minimizes

which is the objective we can perform a gradient step on every iteration to automatically adjust our temperature.

Explorer

🪶 Soft Actor-Critic

Twin-Q

Automatic Temperature

Table of Contents

Backlinks

Graph View

Explorer

🪶 Soft Actor-Critic

Twin-Q §

Automatic Temperature §

Table of Contents

Backlinks

Graph View

Twin-Q

Automatic Temperature