Guided Cost Learning¹ is an extension of the 🎲 MaxEnt algorithm to problems where the environment dynamics are unknown. We start from the gradient definition,

and instead of analytically computing the second expectation, another method would be to sample from , which can be first derived from any soft RL algorithm from 🎛️ Control As Inference.

However, this would involve solving an entire learning problem at every iteration of the inverse RL step, so a natural improvement would be to “lazily” learn some policy to approximate second expectation. That is, rather than solving the entire problem, we’ll improve our solution by a little, then use 🪆 Importance Sampling to account for the approximation.

Then, we have

where is sampled from our approximate policy and the importance weight

Alternating between optimizing and in this manner gives us the guided cost learning algorithm, which produces both a reward that describes the expert trajectories and a policy that follows the reward.

Generator and Discriminator

The alternating framework can be alternatively interpreted as the policy learning to be more similar to the expert and the reward trying to make expert trajectories more likely and our policy’s trajectories less likely—thereby distinguishing the two.

Framed this way, we observe a natural connection to 🖼️ Generative Adversarial Networks with our policy is the generator and the reward function is the discriminator.² The optimal discriminator has the form

so following this formulation, we have explicitly define a discriminator for our expert and learned trajectories,

From here, we can use the standard GAN objective for our discriminator,

and optimize the policy with the gradient

Explorer

🦮 Guided Cost Learning

Generator and Discriminator

Backlinks

Graph View

Explorer

🦮 Guided Cost Learning

Generator and Discriminator §

Footnotes §

Backlinks

Graph View

Generator and Discriminator

Footnotes