Feature matching is an โŒ›๏ธ Inverse Reinforcement Learning technique that uses a linear reward function based on features derived from the state and action. For some feature function , we define reward

The naive matching principle seeks to equalize the expected features across the expert and derived policies; that is, find such that

This is similar to Imitation Learning techniques since weโ€™re just trying to get a policy to match the states we visit in our demonstrations.

However, this approach is ambiguous since there can be many reward functions that give the same trajectory distribution and thus the same expectation. To reduce ambiguity, we can instead borrow the maximum margin idea from ๐Ÿ›ฉ๏ธ Support Vector Machines and seek to find some reward function that makes the expert policy much better than all other policies. That is, we want to find such that the expertโ€™s expected reward, , is better than the next best one by a margin of :

However, in complex or continuous settings, there can be policies that are slightly different from that give extremely similar expected reward. Intuitively, what we actually want is for our optimal policy to be much better than other distinct policies; in other words, we want the margin to be better for more different policies than more similar ones.

Borrowing the SVM primal definition, we can instead optimize the objective

where is a divergence measure that replaces the from the original primal.