Maximum entropy (MaxEnt) is an โ๏ธ Inverse Reinforcement Learning method that uses the principles of ๐๏ธ Control As Inference to derive a reward function from stochastic behavior.
First, recall from the inference framework that
If we parameterize
Since the first term isnโt dependent on
where
Intuitively, our solution should assign high reward to trajectories we saw and low reward to everything else.
Taking the gradient of our objective, we have
Observe that if we move the
Note that the first expectation is over the expert policy, and the second is over the soft optimal one under our current reward.
The first expectation can be found via sampling, but the second must be analytically decomposed:
The probability in our expectation can be broken down into
where
By computing the forward and backward messages (assuming we know the dynamics), we have a method of computing the second expectation.
Thus, the complete gradient is
The MaxEnt algorithm essentially optimizes
The name of this algorithm comes from the observation that if we let
much like ๐ Feature Matching.