Control as inference is a probabilistic framework that recasts reinforcement learning through the lens of inference from a 🪩 Probabilistic Graphical Model. The key advantage of this method is that unlike optimal methods, this framework gives us “soft” policies that explain stochastic behavior and model suboptimal behavior as well.

To start, we describe a trajectory via the following structure, where represents “optimality:

The reason we have optimality is that a graph with only and would simply describe the environment’s dynamics. To incorporate some notion of “good” and “bad,” we use the optimality variable, which can be thought of as a binary variable that’s if optimal and otherwise. We can then define the probability of optimality as

which naturally leads to an optimal trajectory distribution

Now that this model gives us an optimal trajectory distribution, we can perform inference within the graph to derive optimal policies. That is, our policy is

This can be done either through direct inference or variational inference.

Direct Inference

In direct inference, we take inspiration from Forward-Backward Algorithm and use a similar message passing technique.

First, we compute the backward messages and .
Using the backward messages, we can then recover the policy .
For some applications, it’s also useful to find the forward messages .

Backward Message

A backward state-action message is defined as

Expanding out this definition gives us

and the first term, the backward state message, is

We can assume that (actions taking without optimality in mind) is uniform, and noting that the first term is the previous backward message, we have

The backward message passing thus amounts to alternating between computing the two messages starting from down to : 1.

Though the above equations don’t seem very intuitive, we can recast them in the form of 💎 Value Iteration if we let

Then, the above steps are the following: 1.

Note that rather than computing the maximum, our log of the exponentiation finds a “soft” maximum (not to be confused with softmax).

Policy Derivation

Using the backward messages, we can derive, through 🪙 Bayes’ Theorem,

Note that in context of our value functions, this is equivalent to

which essentially gives better actions an exponentially better probability.

Forward Message

A forward message gives us more insight into the trajectories we’ll reach given optimality. Some lengthy algebra gives us

where

More importantly, if we now compute the state marginal under optimality, we find

Intuitively, this tells us that the state distribution is the intersection of states with high probability of reaching the goal (backward) and states with high probability of originating from the initial state (forward).

Variational Inference

Unfortunately, one problem with the above approach is that

is too optimistic. That is, it’s taking a “soft” max over the future states, which are stochastic; thus, this value is based on the fact that the future state is also optimal, which is out of our control and completely up to luck. This problem stems from the inference setup itself: the transition marginal is

which assumes that future states are optimal and thereby gives the “lucky” version of our true environment dynamics, which is not conditioned on .

What we really want is to find another distribution that’s close to but with dynamics . In this reframing, we have an approximation problem; if we have and , we see that our problem reduces to finding to approximate , which can be done via variational inference.

First, to enforce the dynamics, we define

where , our policy, is the only part of the distribution we can control. Applying the 🧬 Evidence Lower Bound, we have

Tightening the bound by maximizing this quantity is thereby equivalent to maximizing reward and action entropy. From here, it can be shown that the policy that maximizes this value is

where our value functions are still “soft,”

These definitions for the value functions and policy can be immediately substituted into 🚀 Q-Learning and 🚓 Policy Gradients for their soft optimality variants, bringing the benefits of improved exploration, easier fine-tuning, and stronger robustness. Using a similar idea gives us 🎲 Entropy Regularization, which forms the basis of 🪶 Soft Actor-Critic.

Explorer

🎛️ Control As Inference

Direct Inference

Backward Message

Policy Derivation

Forward Message

Variational Inference

Table of Contents

Backlinks

Graph View

Explorer

🎛️ Control As Inference

Direct Inference §

Backward Message §

Policy Derivation §

Forward Message §

Variational Inference §

Table of Contents

Backlinks

Graph View

Direct Inference

Backward Message

Policy Derivation

Forward Message

Variational Inference