Control as inference is a probabilistic framework that recasts reinforcement learning through the lens of inference from a ๐ชฉ Probabilistic Graphical Model. The key advantage of this method is that unlike optimal methods, this framework gives us โsoftโ policies that explain stochastic behavior and model suboptimal behavior as well.
To start, we describe a trajectory via the following structure, where
The reason we have optimality is that a graph with only
which naturally leads to an optimal trajectory distribution
Now that this model gives us an optimal trajectory distribution, we can perform inference within the graph to derive optimal policies. That is, our policy is
This can be done either through direct inference or variational inference.
Direct Inference
In direct inference, we take inspiration from Forward-Backward Algorithm and use a similar message passing technique.
- First, we compute the backward messages
and . - Using the backward messages, we can then recover the policy
. - For some applications, itโs also useful to find the forward messages
.
Backward Message
A backward state-action message is defined as
Expanding out this definition gives us
and the first term, the backward state message, is
We can assume that
The backward message passing thus amounts to alternating between computing the two messages starting from
Though the above equations donโt seem very intuitive, we can recast them in the form of ๐ Value Iteration if we let
Then, the above steps are the following: 1.
Note that rather than computing the maximum, our log of the exponentiation finds a โsoftโ maximum (not to be confused with softmax).
Policy Derivation
Using the backward messages, we can derive, through ๐ช Bayesโ Theorem,
Note that in context of our value functions, this is equivalent to
which essentially gives better actions an exponentially better probability.
Forward Message
A forward message
where
More importantly, if we now compute the state marginal under optimality, we find
Intuitively, this tells us that the state distribution is the intersection of states with high probability of reaching the goal (backward) and states with high probability of originating from the initial state (forward).
Variational Inference
Unfortunately, one problem with the above approach is that
is too optimistic. That is, itโs taking a โsoftโ max over the future states, which are stochastic; thus, this value is based on the fact that the future state is also optimal, which is out of our control and completely up to luck. This problem stems from the inference setup itself: the transition marginal is
which assumes that future states are optimal and thereby gives the โluckyโ version of our true environment dynamics, which is not conditioned on
What we really want is to find another distribution
First, to enforce the dynamics, we define
where
Tightening the bound by maximizing this quantity is thereby equivalent to maximizing reward and action entropy. From here, it can be shown that the policy
where our value functions are still โsoft,โ
These definitions for the value functions and policy can be immediately substituted into ๐ Q-Learning and ๐ Policy Gradients for their soft optimality variants, bringing the benefits of improved exploration, easier fine-tuning, and stronger robustness. Using a similar idea gives us ๐ฒ Entropy Regularization, which forms the basis of ๐ชถ Soft Actor-Critic.