# Gradient Boosting
The general idea is that we iteratively train a bunch of simple learners. At each iteration, samples that we incorrectly predicted in the previous iteration get higher weights (that are used when we compute the loss), so that subsequent learners train in a way that is more likely to prioritize them.
A great way to think about gradient boosting is thinking about how it [compares to gradient descent](Gradient%20Descent%20vs%20Gradient%20Boosting.md).
### Gradient Boost Regression
Gradient Boost for regression operates as follows:
1. Create an initial leaf node that is simply the average $y$ value across the entire data set. This is the initial prediction.
2. Build a tree (with 8 - 32 leaves) that attempts to predict the *residuals* (i.e. the difference between the true $y$ values of our data points and that which we predict, the average).
3. Then, update our predictions based on: $\hat{y} = \hat{y} + \alpha \times F(x)$, where $\hat{y}$ is our original prediction (simply the average), $\alpha$ is the learning rate and $F$ is our learned tree.
4. Why do we scale our tree by the learning rate? This allows us to **take a small step in the right direction** (this should feel VERY SIMILAR to gradient descent).
A bit more technically we can define this as:
* **Input**: Data $\{(x_i, y_i)\}_{i=1}^n$ and a *differentiable* **loss function** $L(y_i, F(x))$
* **Step 1:** Initialize model with a constant value:
$F_0(x) = \arg \min_{\gamma} \sum_{i=1}^n L(y_i, \gamma)$
* **Step 2:** Loop to make all $M$ trees ($M$ is frequently 100. $m$ refers to a single tree)
1. Compute the derivative of the loss function wrt the predicted value:
$r_{im} = - \Big[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \Big]_{F(x) = F_{m-1}(x)}$
When the loss function is RRS, this will simplify to just being the **residuals**. In other words, the above expression is saying to just calculate the residuals. $F_{m-1}$ is just the previous learner. $r_{im}$ is simply the residual for a specific sample $i$ for learner $m$.
2. Fit a regression tree to the $r_{im}$ values and create terminal regions $R_{jm}$, for $j=1,\dots,J_m$. This simply means that we are going to build a regression tree to predict the residual values, instead of the weights. We label the leaves $R_{jm}$, where $j$ is the index of the leaf and $m$ is the tree they correspond to
3. For each leaf $j = 1, \dots, J_m$ compute an ouput $\gamma_{jm}$:
$\gamma_{jm} = \arg \min_{\gamma} \sum_{x_i \in R_{ij}} L(y_i, F_{m-1}(x_i) + \gamma)$
Here we are taking the previous prediction into account. The $i$ in $R_{ij}$ means that only the samples that go to that particular leaf are used in calculating it's output value, $\gamma_{jm}$. If we are using RSS as our loss, our $\gamma