Maximum entropy (MaxEnt) is an โŒ›๏ธ Inverse Reinforcement Learning method that uses the principles of ๐ŸŽ›๏ธ Control As Inference to derive a reward function from stochastic behavior.

First, recall from the inference framework that

If we parameterize with parameters and assume that our trajectories come from an optimal function, we seek to maximize the expected probability of the trajectories given optimality,

Since the first term isnโ€™t dependent on , this objective (using an expectation over trajectories) simplifies to

where is our partition function,

Intuitively, our solution should assign high reward to trajectories we saw and low reward to everything else.

Taking the gradient of our objective, we have

Observe that if we move the into the integral, we recover , so this gradient simplifies to

Note that the first expectation is over the expert policy, and the second is over the soft optimal one under our current reward.

The first expectation can be found via sampling, but the second must be analytically decomposed:

The probability in our expectation can be broken down into

where and are our messages from control as inference. From that framework, we have the product of our backward and forward messages represent the state-action visitation probabilities

By computing the forward and backward messages (assuming we know the dynamics), we have a method of computing the second expectation.

Thus, the complete gradient is

The MaxEnt algorithm essentially optimizes with this gradient: we iteratively compute and , then update via .

The name of this algorithm comes from the observation that if we let for some feature function , it can be shown that MaxEnt optimizes

much like ๐Ÿƒ Feature Matching.