# Behavior Cloning
### Simple way to learn policy
Consider the task of driving. You observations may consist of images from the cars camera, and your actions may consist of how you turn the steering wheel to keep the car on the road.

Let us start by considering a simple, supervised-esque approach. We take some labeled data and use that to learn a driving policy with supervised learning. We will get an image from a person and their corresponding action. We will collection many of these tuples (training data) and use that to learn a model. This is known as **Imitation Learning** (or also known as **behavioral cloning)**.

Does this basic principle work? In general the answer is *no*! For some intuition, consider an abstract picture of a control problem:

Here the state axis is one-dimensional only so that it is easier to visualize (in general the state is of course multi dimensional). Lets say that this trajectory represents our training data. We will use this to train a policy that goes from $s$ to $a$, and then we will run that policy. We then will run that policy. Now initially that policy will remain pretty close to the trajectory (because we are going to use a large Neural Network), but it does make some small mistakes. The trouble is that when this model makes a small mistake, it will find itself in a state that is a little bit different from the states that it was trained on. And when this happens the model will tend to make a bigger mistake, because the model doesn't quite know what to do their. As these mistakes compound, the states get more and more different, and the mistakes get larger and larger:

Now in practice we will see that at times this does indeed somewhat work. For instance, NVIDIA published a paper in 2016 where they managed to train a car to drive actually reasonably well. But notice the architecture they used:

Specifically, we see that they used a left, center, and right camera. Consider how this trick may mitigate the drifting problem. Now, these left and right images are essentially teaching the policy how to correct little mistakes. If the policy can correct little mistakes, then maybe they won't accumulate as much. This a special case of a more general principle: while errors in the trajectory will compound, if you can somehow modify your training data so that your training *illustrates* little mistakes and feedbacks to correct those mistakes, then perhaps the policy can *learn* those feedbacks and stabilize:

Something that we could ask to derive a more general solution is: what is the underlying mathematical principle going on here? The challenge is that when we run our policy, the distribution over observations that we see is different. It is different because the policy takes different actions which result in different observations. In other words, after a period of time $p_{\pi_{\theta}}(o_t)$ becomes very different from $p_{data}(o_t)$. Can we fix this? Can we make $p_{\pi_{\theta}}(o_t) = p_{data}(o_t)$?
An idea: What if instead of being clever about our policy, $p_{\pi_{\theta}}(o_t)$, we can try to be clever about out data distribution. This is the method being DAgger (Dataset Aggregation). The goal here is to collect training data from $p_{\pi_{\theta}}(o_t)$ instead of $p_{data}(o_t)$.

The issue with DAgger comes from asking a human to label all actions.
### What is the problem with Imitation Learning?
Humans need to provide the data. DL method tends to work well in situations where data is plentiful. We see that naive behavior cloning ends up with the expected error increasing *quadratically* vs the number of time steps.

---
Date: 20211207
Links to: [Reinforcement-Learning-from-Supervised-Perspective](Reinforcement-Learning-from-Supervised-Perspective.md) [Reinforcement Learning (old)](Reinforcement%20Learning%20(old).md)
Tags:
References:
* []()