# Reinforcement Learning ### Overview Reinforcement learning is a computational approach to learning from interaction with the environment. Below we can see the following: We have our **agent** which sends and **action** in the **environment**, from which it then makes an **observation** or gets a **reward** (or potentially suffers a **loss**). ![](Screen-Shot-2021-11-17-at-65109-AM.png) The fundamental idea behind is reinforcement learning is as follows: > We are constantly mapping **state** to **actions** in order to *maximize* a **reward signal**. A few challenges to RL are: * **Search**: The classic [Exploration-Exploitation](Exploration-Exploitation.md) problem. Note: this have many similarities to simply hill climbing (optimization) in a high dimensional space. * **Delayed Reward**: Agents must consider more than the immediate reward because acting greedily like this may result in less future reward. ### Key Elements **Policy** This is the mapping from *states* to *actions*. This defines an agents behavior. Policies are generally stochastic, meaning that we sample an action from a probability distribution compared to something like supervised learning where we would take the argmax of the distribution. **Reward** This is the signal for the Reinforcement Learning agent. It using this to determine how to make changes to its policy in order to maximize the reward. So at each time step the agent sends an *action* to the *environment* and the environment sends back a new state or reward. **Value Function** The value function is used to assign values to states. This specifies what is good in the long run vs. the *reward*, which is an immediate signal. The value of a state is the reward the agent can *expect* starting from that state. Values correspond to a more refined and farsighted judgment of how pleased or displeased we are that our environment is in a particular state. **Model** A model mimics the behavior of the environment. It allows you to do inference about how the environment might behave. Given a state and an action, the model might predict the resultant next state and next reward. Models are used for planning, considering future situations before experiencing them. Model-Based learners use planning, model free learners explicitly use trial and error. **Stochasticity**: Note that in general we want to separate **policy randomness** from **state transition randomness** (see more [here](https://www.nathanieldake.com/AI/01-Reinforcement_Learning-04-Markov-Decision-Processes.html#7.6-Example-6)) ### [Reinforcement Learning vs Supervised Learning](Reinforcement-Learning-vs-Supervised-Learning.md) ### Reward vs. Value Rewards are given directly by the environment, but values must be estimated and re-estimated from the sequences of observations an agent makes over its entire lifetime. The most important component of all RL algorithms is the method for efficiently estimating values and the value function. This is arguably the most important breakthrough in RL over the last 6 decades. ### Feedback In a very real sense RL exhibits feedback. This means that the actions we take influence the state we are in. At the same time, the state we are in influences the actions available to us! ### Papers 1. [Learning-World-Graphs-to-Accelerate-Hierarchical-Reinforcement-Learning](Learning-World-Graphs-to-Accelerate-Hierarchical-Reinforcement-Learning.md) ### Model Based vs. Model Free Model based means that we know the transition probabilities of the next state and reward, given the current state-action pair. Model free learns solely through trial and error learning. ### Comparison to Back Propagation Note that RL shares similar high level goals to [Backpropagation](Backpropagation.md). When estimating the Value Function, we may know the correct value of the winning and losing states. We then try and propagate this information backwards to states that can directly reach the winning and losing states, and then propagate that information backwards, and so on. This is similar (at a high, intuitive level) to backprop. We have an error signal between our prediction and the true target value, and we try and propagate that signal backwards, layer by layer, to the starting layer in the network. ### [Trees and Reinforcement Learning](Trees-and-Reinforcement-Learning.md) ### Prediction vs. Control Problem We generally call the act of finding the value function for a given policy the **prediction problem**. Soon, we will learn an algorithm for finding the optimal policy, which is known as the **control problem**. ### On Policy vs. Off Policy ![](Screen%20Shot%202021-12-07%20at%208.41.37%20AM.png) ### [Reinforcement Learning from Supervised Perspective](Reinforcement-Learning-from-Supervised-Perspective.md) ### RL (UC Berkeley Overview) We need to start by identifying a way to determine what actions are good to take given that the agent is in a particular state. Crucially, the objective for RL is not only to take actions that have high rewards *right now* but also to take actions that will have high rewards *later*. Together, the states, actions, rewards, and probability define our [Markov Decision Process](Markov-Decision-Process.md). We can then define a **goal**. ![](Screen%20Shot%202021-12-07%20at%207.29.36%20AM.png) ![](Screen%20Shot%202021-12-07%20at%207.31.45%20AM.png) ![](Screen%20Shot%202021-12-07%20at%207.32.05%20AM.png) We can now touch on a very key idea of RL: > Reinforcement Learning is all about **maximizing expectations**. We generally talk about RL in terms of choosing actions that lead to high rewards, but we are always really concerned about expected values of rewards. The interesting thing about expected values is that expected values can be continuous in the parameters of the corresponding distributions, even when the function that we are taking the expectation of is itself highly discontinuous. ![](Probability%20(models%20and%20inequalities)%209.png) This is incredibly important to understand why RL can use **smooth optimization** methods like **gradient descent** to optimize objectives that are seemingly non differentiable, like binary rewards for winning or losing a game. For instance, consider driving down a road, where the reward is +1 for staying on the road and -1 for going off. So the reward function here appears to be discontinuous and if you try to optimize the reward function with respect to the position of the car, that optimization problem can't really be solved with gradient based methods because the reward is not a continuous (or much less a differentiable) function of the cars position. However, if you have a probability distribution over some binary action (say fall or don't fall), and it is a bernoulli RV with parameter $\theta$, where you fall off road with probability $\theta$ the car falls off the road. Now, the interesting thing is that the expected value of the reward is *smooth* in $\theta$! ![Expected-value-smooth-in-theta|300](Screen%20Shot%202021-12-07%20at%207.59.21%20AM.png) So, we now have a perfectly smooth and perfectly differentiable (in $\theta$) function. This property demonstrates why RL can optimize seemingly non smooth and sparse reward functions. Again, the reason for this is: > Expected values of non smooth and non differentiable functions, under differentiable and smooth probability distributions, are themselves smooth and differentiable. ### [Reinforcement Learning Algorithms](Reinforcement-Learning-Algorithms.md) ### Q-Values Q-values are our current estimates of the sum of future rewards. $Q$ is defined as a function of $Q(s, a)$. That is, Q-values estimate how much additional reward we can accumulate through all remaining steps in the current episode if the AI agent is in state $a$ and takes action $a$. $Q$-values therefore increase as the AI agent gets closer and closer to the highest reward. The Q function can represent the policy that an agent should follow. Given a state take the action that maximizes Q. ### Temporal Differences TDs provide us with a method of calculating how much the Q-value for the action taken in the previous state should be changed based on what the AI agent has learned about the Q-values for the current state's actions. Previous Q-values are therefore updated after each step. Below, we can see that TD is based on our *immediate reward*, $r_t$, and incorporates the expectation of future rewards via $Q$ (discounted by $\gamma$). ![](Screen%20Shot%202021-12-07%20at%208.55.35%20AM.png) ![](Screen%20Shot%202021-12-07%20at%208.59.45%20AM.png) ### RL Review (Steve Brunton) > The **value** of a **state** $s$ given a **policy** $\pi$ is the *expectation* of how much **reward** we will get in the future, if we start in state $s$ and we enact policy $\pi$. > The goal in reinforcement learning is to **optimize your policy** in order to **maximize future rewards**. > Our environment is considered to have a random component, which we model via [Markov Decision Process's](Markov-Decision-Process.md). This simply means that if we are in state $s$ now and take action $a$ now, there is some probability of me going to a new state $s$ at the next time step. We could go to multiple different states but which one in particular we go to is a stochastic process. > The **Credit Assignment Problem** refers to the fact that because our rewards are often sparse and infrequent, it is very hard to tell what action sequence was responsible for getting that reward. We can imagine playing a great game of chess, making one bad move and then losing. It was the one bad move that lost us the game, but how do we ensure that we don't associate the rest of the game/actions with losing? This is the *the central challenge* in Reinforcement Learning. We can have **dense** or **sparse** rewards. In the case of dense rewards, the credit assignment problem can be solved faster. If we have sparse rewards, RL is very *sample inefficient*; meaning, if we only get sparse rewards, we will have to play many, many, many times, getting tons of examples to learn a good optimal policy, given those sparse rewards. > All of machine learning and all of control theory are **optimization problems**. In the case of machine learning you solve them with data. In the case of control you solve them subject to the constraints of the dynamics. Reinforcement learning is no different, it is also a big optimization problem. A few approaches to this optimization problem are: > * **Monte Carlo** - Try a bunch of stuff randomly, sampling the space of possible trajectories, see which ones are good. > * **Temporal Difference** - Model free, similar to monte carlo with differential programming > * **Bellman** - Pioneer for optimal control theory > * **Exploration** vs. **Exploitation** > * **Policy Iteration** - You setup a dynamical system where based on your rewards you iteratively update the system to make it better and better over time based on information from new rewards. To do this you can use **gradient descent**, **evolutionary optimization**, **simulated annealing**, as well as all modern tools such as **stochastic gradient descent**, **adam optimization** > In most methods we learn the **policy** and the **value function** separately. In **Q-Learning** we can learn them both at the same time. In Q-Learning we have a function $Q(s, a)$, which tells us the quality of the state-action pair. Note that Q-Learning specifically assumes that we always take the best action available in the future. So, assuming we always take the best action that we can in the future, what is the quality of the current state-action pair. To say this again: given a state $s$ and an action $a$, and assuming that we do the best thing we can in the future, what is the quality of being in that state and taking that action. This is great because if we know the quality function, once we find ourselves in a state $s$, we just need to look across all actions and chose the $a$ that maximize $Q$ for that particular $s$. Then enact that action. If we do that in the future we will maximize the value. Note that this function can actually be learned with deep nets! > **Hindsight replay** allows us to be much more data efficient and learn much hard tasks. See more [here](https://youtu.be/0MNVhXEX9to?t=1439). > Deep Reinforcement Learning is when we augment our standard RL picture to use deep nets in some way. This can be via replacing our policy with a DN. In this case our policy, $\pi$, is parameterized by $\theta$, and it maps the current state to the best probabilistic action to take in that environment. Again, the name of the game is to update the policy to maximize future rewards. ![](_1-33%20screenshot.png) ![](_2-1%20screenshot%201.png) > Another area of Deep Learning applied to RL is Deep Q-Learning. Here, deep nets are used to learn $Q$. Once you have learned $Q$, if you find yourself in state $s$ you can just lookup the $a$ that gives you the best possible quality for that state $s$. > Note that Q-Learning is very intuitive. Imagine a person learning to play chess. They would learn a policy (i.e. these are the moves I will make in this situation, these are the moves I will make in another situation, etc), while also learning a value function (i.e. how do we value different board positions, how do we gauge the strength of our position). Now these quality functions ($Q$) may be very complex functions of $s$ and $a$, and that is exactly what NN are good at! > DRL still suffers from the problems of regular RL, such as the credit assignment problem. --- Date: 20211112 Links to: [AI MOC](AI%20MOC.md) [Deep-Reinforcement-Learning](Deep-Reinforcement-Learning.md) Tags: References: * [Intro to Reinforcement Learning, henry ai labs](https://www.youtube.com/watch?v=4SLGEq_HZxk&list=PLnn6VZp3hqNvRrdnMOVtgV64F_O-61C1D&index=1) * [Great intro course, combined with DL](https://www.youtube.com/watch?v=JHrlF10v2Og&list=PL_iWQOsE6TfURIIhCrlt-wj9ByIVpbfGc) * [Steve Brunton RL: ML meets Control Theory](https://www.youtube.com/watch?v=0MNVhXEX9to&t=658s)