# Gradient Boosting The general idea is that we iteratively train a bunch of simple learners. At each iteration, samples that we incorrectly predicted in the previous iteration get higher weights (that are used when we compute the loss), so that subsequent learners train in a way that is more likely to prioritize them. A great way to think about gradient boosting is thinking about how it [compares to gradient descent](Gradient%20Descent%20vs%20Gradient%20Boosting.md). ### Gradient Boost Regression Gradient Boost for regression operates as follows: 1. Create an initial leaf node that is simply the average $y$ value across the entire data set. This is the initial prediction. 2. Build a tree (with 8 - 32 leaves) that attempts to predict the *residuals* (i.e. the difference between the true $y$ values of our data points and that which we predict, the average). 3. Then, update our predictions based on: $\hat{y} = \hat{y} + \alpha \times F(x)$, where $\hat{y}$ is our original prediction (simply the average), $\alpha$ is the learning rate and $F$ is our learned tree. 4. Why do we scale our tree by the learning rate? This allows us to **take a small step in the right direction** (this should feel VERY SIMILAR to gradient descent). A bit more technically we can define this as: * **Input**: Data $\{(x_i, y_i)\}_{i=1}^n$ and a *differentiable* **loss function** $L(y_i, F(x))$ * **Step 1:** Initialize model with a constant value: $F_0(x) = \arg \min_{\gamma} \sum_{i=1}^n L(y_i, \gamma)$ * **Step 2:** Loop to make all $M$ trees ($M$ is frequently 100. $m$ refers to a single tree) 1. Compute the derivative of the loss function wrt the predicted value: $r_{im} = - \Big[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \Big]_{F(x) = F_{m-1}(x)}$ When the loss function is RRS, this will simplify to just being the **residuals**. In other words, the above expression is saying to just calculate the residuals. $F_{m-1}$ is just the previous learner. $r_{im}$ is simply the residual for a specific sample $i$ for learner $m$. 2. Fit a regression tree to the $r_{im}$ values and create terminal regions $R_{jm}$, for $j=1,\dots,J_m$. This simply means that we are going to build a regression tree to predict the residual values, instead of the weights. We label the leaves $R_{jm}$, where $j$ is the index of the leaf and $m$ is the tree they correspond to 3. For each leaf $j = 1, \dots, J_m$ compute an ouput $\gamma_{jm}$: $\gamma_{jm} = \arg \min_{\gamma} \sum_{x_i \in R_{ij}} L(y_i, F_{m-1}(x_i) + \gamma)$ Here we are taking the previous prediction into account. The $i$ in $R_{ij}$ means that only the samples that go to that particular leaf are used in calculating it's output value, $\gamma_{jm}$. If we are using RSS as our loss, our $\gammas will always be the average output values in the leaf. 4. Make a new prediction for each sample, i.e. update: $F_m(x) = F_{m-1}(x) + \nu \sum_{j=1}^{J_m} \gamma_{jm} I(x \in R_{jm})$ Note: the summation is present above just in case a single sample ends up in multiple leaves. $\nu$ is the learning rate. It reduces the effect that each tree has on the final prediction, which improves accuracy in the long run. TODO: How did this update incorporate direction/know the right way to move, based on the gradient? * **Step 3**: Output $F_M(x)$ ### Compared to [AdaBoost](AdaBoost.md) * AdaBoost starts by building a stump to predict the $y$ values for each point. The amount of say (weight) that this stump has is based on how well it compensated for the previous errors. AdaBoost then builds the next stump based on the errors the previous stump made. AdaBoost then continues to make stumps in this fashion until has made the number of stumps asked for, or has a perfect fit. * In contract, Gradient Boost starts by making a *single leaf*, instead of a tree or stump. This leaf represents an initial guess for all points. When our dependent variable we are trying to predict ($y$) is continuous, the first guess is the average value. Unlike AdaBoost, gradient boost usually has trees that are larger than a stump (but, gradient boost still does restrict the size of the tree. Generally, gradient boost allows for 8 to 32 leaves). * Like AdaBoost, gradient boost scales the trees. However, it scales them all by the same amount. ### General Boosting Intuition Frequently in statistics, if we have a model with a large number of parameters, we try to optimize them *jointly* (see [here](https://youtu.be/wPqtzj5VZus?t=1145)). Neural networks are like that as well. We optimize all parameters jointly. However, in Boosting, we optimize parameters in a stagewise fashion. So, each time we optimize the parameters of the next tree while all the others are held fixed. This is known as stagewise fitting, and it slows down the rate that you over fit by not going back and adjusting what you did in the past (i.e. it doesn't go back and change previous trees). By using a shrinking factor, $\nu$, it allows the algorithm to not use up all of the data in big chunks. Instead, it gives it many opportunities to fit to the residuals. To summarize: > Boosting can be thought of as minimizing the loss function in a stagewise fashion. --- Date: 20211116 Links to: [003-Data-Science-MOC](003-Data-Science-MOC.md) Tags: References: * [Good youtube intro](https://www.youtube.com/watch?v=en2bmeB4QUo) * [My blog post](https://www.nathanieldake.com/Machine_Learning/06-Ensemble_methods-05-AdaBoost.html) * Chapter 10, Elements of Statistical Learning * [Gradient boost, josh starmer part 1](https://www.youtube.com/watch?v=3CC4N4z3GJc) * [Gradient boost, josh starmer part 2](https://www.youtube.com/watch?v=2xudPOBz-vs) * [Original Paper](https://jerryfriedman.su.domains/ftp/stobst.pdf) * [Another good video](https://www.youtube.com/watch?v=en2bmeB4QUo&t=657s) * [Great lecture by Trevor Hastie](https://www.youtube.com/watch?v=wPqtzj5VZus&t=315s)