Guided Cost Learning1 is an extension of the ๐ฒ MaxEnt algorithm to problems where the environment dynamics are unknown. We start from the gradient definition,
and instead of analytically computing the second expectation, another method would be to sample from
However, this would involve solving an entire learning problem at every iteration of the inverse RL step, so a natural improvement would be to โlazilyโ learn some policy
Then, we have
where
Alternating between optimizing
Generator and Discriminator
The alternating framework can be alternatively interpreted as the policy
Framed this way, we observe a natural connection to ๐ผ๏ธ Generative Adversarial Networks with our policy is the generator and the reward function is the discriminator.2 The optimal discriminator has the form
so following this formulation, we have explicitly define a discriminator for our expert and learned trajectories,
From here, we can use the standard GAN objective for our discriminator,
and optimize the policy with the gradient