> Backpropagation has become the bread and butter mechanism for Deep Learning. Researchers had discovered that one can employ any computation layer in a solution with the only requirement being that the layer must be differentiable. Said differently, that one is able to calculate the gradient of layer. In more plain speak, that in the game of ‘hotter’ and ‘colder’, *that the verbal hints that are made accurately reflect the distance between the blindfolded player and his objective*.
### 3b1b
Here is some beautiful intuition. Consider the example of a neural network that, for one particular example, is trying to classify a digit as a 2. It starts with predicting that our 2 is in fact a 3:

Our goal is for the network to update its parameters in such a way that the final node value for 2 increases, while all others decrease. So, back propagation starts by looking at each node in the output layer, then looking at how the final hidden layer feeds into, and then noting how each node in the output layer would like the hidden layer to change so as to move it in the right direction. For instance, we see that the output node representing a 2 may desire the activations of the hidden layer to move as follows:

Again, we want to keep track of all output nodes desires. All nodes other than 2 wish to have their final value decrease, so as to help minimize our final cost function. We can "keep track" of all our nodes final desires, shown below:

Now, again, we *do not* have the ability to directly change the activations in the final hidden layer. They are *directly dependent* on the weights, biases and activations of the previous hidden layer. But, notice the **recursive structure** that is starting to present itself! At first we only knew what the *final output layer desired*. We used that to determine what changes needed to occur in our final hidden layer to achieve that. We can *recursively apply* this type of process again! We can say "we know what our final hidden layer wants, what needs to change about our activations of the 2nd final hidden layer to achieve this?"
By adding together all of the desired effects, we get a list nudges that we want to happen to our final hidden layer. This allows us to again recursively apply the same process to the weights and biases that determine those values.

In terms of mechanics, let's just be clear about something. If we want to increase the final output value of the 2 node, we can:
* Increase the bias for that node, $b$
* Increase $w_i$, in proportion to $a_i$

It is this 3rd way of increase the 2's activation, namely that of increasing the $a_i