Behavioral cloning is an Imitation Learning technique that trains a policy to mimic an expert. The expert generates pairs, essentially converting our problem to a standard supervised learning problem. Our policy is thus trained to predict action from observations using this dataset.

One significant problem with this approach is that only learns states that the expert encounters. Since the expert performs โ€œcorrectโ€ actions, the dataset only consists of states that we reach after such actions. Thus, the dataset has no examples on states after the agent makes a mistake, and once an agent makes a mistakeโ€”due to random chance or incorrect predictionโ€”it will make even larger mistakes. This drifting error is shown below.

Data Robustness

To make the policy more robust, we need to introduce imperfections to our data, which will allow to model to learn actions that correct possible mistakes. To do this, there are a variety of approaches.

  1. Some experiments naturally lend to enhanced data collection procedures. A common example for vision observations is to capture the front, left, and right views; the left and right viewed are annotated with actions thatโ€™ll correct for the tilt (such as โ€œturn rightโ€ and โ€œturn left,โ€ respectively).
  2. ๐Ÿ—ก๏ธ DAgger iteratively improves the dataset with annotated samples from the policyโ€™s experience, thus allowing it to learn from past mistakes.

Accurate Mimicry

Another remedy for drifting is to reduce the modelโ€™s error in the first place. If we donโ€™t make mistakes, we wonโ€™t deviate from the expected trajectory.

However, two properties of human behavior make accurate mimicry difficult.

  1. First, our actions generally exhibit non-Markovian behavior. Thus, our policy requires not just the current observation but also past ones, . One common way to implement this is via a ๐Ÿ’ฌ Recurrent Neural Network reading in one observation at a time in sequence.
  2. Second, our choice of action is usually multimodal (and sometimes random). Our predictions must then account for this by either outputting a mixture of Gaussians, basing the action off random latent variables, or via autoregressive discretizationโ€”sampling discrete values for each output dimension one-by-one.

Note

One danger in modeling non-Markovian behavior is causal confusion. If our model mistakes correlations as causations, it will fail in testing. For example, if a camera sees the brake light in a car turn on during braking, it might associate the brake light for the reason behind braking rather than the obstacle behind the window.

Goal-Conditioned Behavioral Cloning

If our training data has multiple successful outcomes, it might be difficult to train the model for each outcome. One solution is to condition the policy on our outcome as well, where is a successful state so that we can train the model using the entire dataset instead of learning each outcome separately.

Cost Analysis

The following is a mathematical analysis of our policyโ€™s error, or cost. This provides a more formal justification for our intuitive explanation of the drifting problem.

Assume our supervised learning policy has error after training on the training data distribution. That is, it incorrectly predicts a training example with probability . Furthermore, assume that weโ€™re training with base behavioral cloning without any of the enhancements above.

Let be the cost of our action. Itโ€™s equal to if and otherwise.

Consider state after steps. If we make no mistakes, our state will be in the training distribution. Otherwise, itโ€™ll be in some mistake distribution. We can split our policy distribution into these two cases,

Now, consider the ๐Ÿ‘Ÿ Total Variation Distance, the sum of absolute value differences between probabilities, between and ,

In the worst case, since the distributions can be completely flipped. Then, we have

via the identity for .

Finally, the expected value of the cost over time is the following:

In the second-to-last step, we use the fact that our modelโ€™s expected error on is and plug in the inequality derived above.

This thus shows that the expected error of our policy scales quadratically with time, which wonโ€™t work in real-world applications.