# Reinforcement Learning from Supervised Perspective Consider the case of image classification, commonly thought of as a supervised learning problem. Below, we can think of the **observation** at time $t$ as the tiger, our **policy** is the learned Deep Net, and the **action** is the prediction that it makes (in this case, it predicts tiger). In RL we know that the action generally has impact on the next observation that we see. So in this case, the action of failing to recognize the tiger likely means that at the next time step you will see something undesirable such as the tiger being closer. ![](Screen%20Shot%202021-12-03%20at%207.28.48%20AM.png) This basic idea can be extended to learn policies for control. So instead of outputting labels the model would output something that looks more like an action. It could still be a discrete action: ![](Screen%20Shot%202021-12-03%20at%207.29.25%20AM.png) It could also be a continuous action. For instance, our policy could output the parameters of some continuous distribution such as the mean and variance of a multivariate normal (gaussian) distribution: ![](Screen%20Shot%202021-12-03%20at%207.29.36%20AM.png) Another term that we will see a lot is the state, as well as the policy given the state: ![](Screen%20Shot%202021-12-03%20at%207.38.08%20AM.png) Now, you make ask what is the distinction between states and observations? [This lecture](https://youtu.be/HUzyjOsd2PA?list=PL_iWQOsE6TfURIIhCrlt-wj9ByIVpbfGc&t=227) does a great job of explaining it, but we can describe it as: * Below, we see a cheetah and a gazelle. The observation consists of the pixels in the image. * The state may consist of the position of each animal, as well as their velocities ![](Screen%20Shot%202021-12-03%20at%207.41.23%20AM.png) * Now, the observation may be altered in some way that prevents the state from being able to be determined exactly. For instance, if a car drives in front of the cheetah we may no longer be able to infer it's state. However, the state hasn't change, it is still what it was before, it is just that the pixels in the observation are not enough to figure out where it is. ![](Screen%20Shot%202021-12-03%20at%207.42.24%20AM.png) We can summarize this as: * **States** are the true configuration of the system. * **Observations** are something that *result from* the state, which may or may not be enough to *deduce* the state More formally, we can distinguish between states and observations by using **graphical models**: ![](Screen%20Shot%202021-12-03%20at%207.45.58%20AM.png) ![](Screen%20Shot%202021-12-03%20at%207.46.33%20AM.png) This nicely highlights the [Markov Property](Markov%20Property.md). In reinforcement learning we state that: $p(s_{t+1} \mid s_t, s_{t-1}, \dots, a_t) = p(s_{t+1} \mid s_t, a_t)$ In other words, given the current state and action the next state is *independent* from the previous state. Without the Markov Property we would not be able to formulate optimal policies without considering entire histories. And this allows us to formulate the distinction between states and observations even more succinctly: > The **state** satisfies the Markov Property, the **observation** does not. However, what if our policy is conditional on observations rather than states (as it is above), we could ask: "Are the observations currently independent in this way? Is the current observation entirely sufficient to figure out how to act so as to reach some state in the future?" The trouble is that in general the observation is *not* going to be enough to satisfy the Markov Property. Meaning that the current observation might not be enough to fully determine the future without also observing the past. This is most obvious from the example above where the cheetah is blocked by the car. In general, when using observations, past observations can give you additional information beyond the current observation that could be useful in decision making. ### [Behavior Cloning](Behavior%20Cloning.md) --- Date: 20211203 Links to: [Reinforcement Learning (old)](Reinforcement%20Learning%20(old).md) [Machine-Learning](Machine-Learning.md) Tags: References: * [Great lecture from UC Berkeley](https://youtu.be/HUzyjOsd2PA?list=PL_iWQOsE6TfURIIhCrlt-wj9ByIVpbfGc&t=616) *