Control as inference is a probabilistic framework that recasts reinforcement learning through the lens of inference from a ๐Ÿชฉ Probabilistic Graphical Model. The key advantage of this method is that unlike optimal methods, this framework gives us โ€œsoftโ€ policies that explain stochastic behavior and model suboptimal behavior as well.

To start, we describe a trajectory via the following structure, where represents โ€œoptimality:

The reason we have optimality is that a graph with only and would simply describe the environmentโ€™s dynamics. To incorporate some notion of โ€œgoodโ€ and โ€œbad,โ€ we use the optimality variable, which can be thought of as a binary variable thatโ€™s if optimal and otherwise. We can then define the probability of optimality as

which naturally leads to an optimal trajectory distribution

Now that this model gives us an optimal trajectory distribution, we can perform inference within the graph to derive optimal policies. That is, our policy is

This can be done either through direct inference or variational inference.

Direct Inference

In direct inference, we take inspiration from Forward-Backward Algorithm and use a similar message passing technique.

  1. First, we compute the backward messages and .
  2. Using the backward messages, we can then recover the policy .
  3. For some applications, itโ€™s also useful to find the forward messages .

Backward Message

A backward state-action message is defined as

Expanding out this definition gives us

and the first term, the backward state message, is

We can assume that (actions taking without optimality in mind) is uniform, and noting that the first term is the previous backward message, we have

The backward message passing thus amounts to alternating between computing the two messages starting from down to : 1.

Though the above equations donโ€™t seem very intuitive, we can recast them in the form of ๐Ÿ’Ž Value Iteration if we let

Then, the above steps are the following: 1.

Note that rather than computing the maximum, our log of the exponentiation finds a โ€œsoftโ€ maximum (not to be confused with softmax).

Policy Derivation

Using the backward messages, we can derive, through ๐Ÿช™ Bayesโ€™ Theorem,

Note that in context of our value functions, this is equivalent to

which essentially gives better actions an exponentially better probability.

Forward Message

A forward message gives us more insight into the trajectories weโ€™ll reach given optimality. Some lengthy algebra gives us

where

More importantly, if we now compute the state marginal under optimality, we find

Intuitively, this tells us that the state distribution is the intersection of states with high probability of reaching the goal (backward) and states with high probability of originating from the initial state (forward).

Variational Inference

Unfortunately, one problem with the above approach is that

is too optimistic. That is, itโ€™s taking a โ€œsoftโ€ max over the future states, which are stochastic; thus, this value is based on the fact that the future state is also optimal, which is out of our control and completely up to luck. This problem stems from the inference setup itself: the transition marginal is

which assumes that future states are optimal and thereby gives the โ€œluckyโ€ version of our true environment dynamics, which is not conditioned on .

What we really want is to find another distribution thatโ€™s close to but with dynamics . In this reframing, we have an approximation problem; if we have and , we see that our problem reduces to finding to approximate , which can be done via variational inference.

First, to enforce the dynamics, we define

where , our policy, is the only part of the distribution we can control. Applying the ๐Ÿงฌ Evidence Lower Bound, we have

Tightening the bound by maximizing this quantity is thereby equivalent to maximizing reward and action entropy. From here, it can be shown that the policy that maximizes this value is

where our value functions are still โ€œsoft,โ€

These definitions for the value functions and policy can be immediately substituted into ๐Ÿš€ Q-Learning and ๐Ÿš“ Policy Gradients for their soft optimality variants, bringing the benefits of improved exploration, easier fine-tuning, and stronger robustness. Using a similar idea gives us ๐ŸŽฒ Entropy Regularization, which forms the basis of ๐Ÿชถ Soft Actor-Critic.