Behavioral cloning is an Imitation Learning technique that trains a policy
One significant problem with this approach is that
Data Robustness
To make the policy more robust, we need to introduce imperfections to our data, which will allow to model to learn actions that correct possible mistakes. To do this, there are a variety of approaches.
- Some experiments naturally lend to enhanced data collection procedures. A common example for vision observations is to capture the front, left, and right views; the left and right viewed are annotated with actions thatโll correct for the tilt (such as โturn rightโ and โturn left,โ respectively).
- ๐ก๏ธ DAgger iteratively improves the dataset with annotated samples from the policyโs experience, thus allowing it to learn from past mistakes.
Accurate Mimicry
Another remedy for drifting is to reduce the modelโs error in the first place. If we donโt make mistakes, we wonโt deviate from the expected trajectory.
However, two properties of human behavior make accurate mimicry difficult.
- First, our actions generally exhibit non-Markovian behavior. Thus, our policy requires not just the current observation but also past ones,
. One common way to implement this is via a ๐ฌ Recurrent Neural Network reading in one observation at a time in sequence. - Second, our choice of action is usually multimodal (and sometimes random). Our predictions must then account for this by either outputting a mixture of Gaussians, basing the action off random latent variables, or via autoregressive discretizationโsampling discrete values for each output dimension one-by-one.
Note
One danger in modeling non-Markovian behavior is causal confusion. If our model mistakes correlations as causations, it will fail in testing. For example, if a camera sees the brake light in a car turn on during braking, it might associate the brake light for the reason behind braking rather than the obstacle behind the window.
Goal-Conditioned Behavioral Cloning
If our training data has multiple successful outcomes, it might be difficult to train the model for each outcome. One solution is to condition the policy on our outcome as well,
Cost Analysis
The following is a mathematical analysis of our policyโs error, or cost. This provides a more formal justification for our intuitive explanation of the drifting problem.
Assume our supervised learning policy
Let
Consider state
Now, consider the ๐ Total Variation Distance, the sum of absolute value differences between probabilities, between
In the worst case,
via the identity
Finally, the expected value of the cost
In the second-to-last step, we use the fact that our modelโs expected error on
This thus shows that the expected error of our policy scales quadratically with time, which wonโt work in real-world applications.