Guided Cost Learning1 is an extension of the ๐ŸŽฒ MaxEnt algorithm to problems where the environment dynamics are unknown. We start from the gradient definition,

and instead of analytically computing the second expectation, another method would be to sample from , which can be first derived from any soft RL algorithm from ๐ŸŽ›๏ธ Control As Inference.

However, this would involve solving an entire learning problem at every iteration of the inverse RL step, so a natural improvement would be to โ€œlazilyโ€ learn some policy to approximate second expectation. That is, rather than solving the entire problem, weโ€™ll improve our solution by a little, then use ๐Ÿช† Importance Sampling to account for the approximation.

Then, we have

where is sampled from our approximate policy and the importance weight

Alternating between optimizing and in this manner gives us the guided cost learning algorithm, which produces both a reward that describes the expert trajectories and a policy that follows the reward.

Generator and Discriminator

The alternating framework can be alternatively interpreted as the policy learning to be more similar to the expert and the reward trying to make expert trajectories more likely and our policyโ€™s trajectories less likelyโ€”thereby distinguishing the two.

Framed this way, we observe a natural connection to ๐Ÿ–ผ๏ธ Generative Adversarial Networks with our policy is the generator and the reward function is the discriminator.2 The optimal discriminator has the form

so following this formulation, we have explicitly define a discriminator for our expert and learned trajectories,

From here, we can use the standard GAN objective for our discriminator,

and optimize the policy with the gradient

Footnotes

  1. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization (Finn et al, 2016) โ†ฉ

  2. A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models (Finn et al, 2016) โ†ฉ