# Chain Rule
My confusion/challenge came from not fully appreciating a few key facts about derivatives:
1. Derivatives are literally saying that locally (in a small range around a given point) a function can be approximately linearly. So, given a function $f$, it's derivative is going to tell us the slope of the line tangent to it at any given point. This slope is simply a ratio between $df$ and $dx$, where $dx$ is a tiny nudge to $x$ that approaches $0$:
$\frac{d}{dx}(f) = \frac{df}{dx} = \lim_{dx \to 0} \frac{df}{dx}$
2. Because the derivative is literally a ratio (albeit one where the denominator goes to 0) we are able to then say: locally in a given input range, this is the associated slope of the line. This slope literally means $\frac{rise}{run}$. So, if we have the slope, and then we are given a $run$ (i.e. a $dx$) we can easily find the $rise$ (i.e. the $df$). This allows us to multiply the $dx$ from the denominator at times and move it to the numerator.
### Why is the chain rule multiplicative?
Consider a variable $z$ that is a function of $y$, where $y$ is in turn a function of $x$. We can write the chain rule here as:
$\frac{dz}{dx} = \frac{dz}{dy} \frac{dy}{dx}$
Where the intuition is that we want to know how changing $x$ will impact $z$. Now there is likely no confusion in regards to the following logic:
* Changing $y$ will impact $z$
* Changing $x$ will impact $y$
* So, changing $x$ will impact $z$
* This impact is likely some combination of $x$‘s impact on $y$, and then $y$‘s impact on $z$
The question that this still does not answer is *why* do we *multiply these impacts*? After all you could imagine other ways of combining them. We could add them, subtract them, raise them to $\pi$ and push them through a $sin$ function and divide by $1,000$ - you get the point. We could combine these in an *infinite number of ways*! So, why is multiplication the best option?
I would argue that it is easiest to think about this in terms of *effects* of one number on another via an operation. Say we have the expression:
$c = 5 \times 3$
We can write that as:
$c = f(3)$
Where $f(x) = 5x$. So we can think of $f$ as simply returning a starting base value of $5$, *scaled smoothly* by some amount - in this case $3$. It is this notion of *smooth scaling* that is useful here. Consider for a moment the number $5$ itself. One way to think about this number is as the *area* below:

In this case $5$ is simply the shaded region ($5 \times 1$). If we think about impacting this number, one of the smoothest and simplest ways to do so is via changing the number we are sweeping by. For instance, we can sweep via $3$:

The key idea here is that if we want to think about the smallest simplest way that changing an input can impact an output, multiplication is the way to go.
Say we were to use addition as the way to combine our derivatives in the chain rule. Why would that not work? Because addition wouldn’t allow the change in $y$ with respect to $x$ to be ***incorporated*** into the change of $z$ with respect to $y$. It could be *combined*, of course, via addition. But it couldn’t be *directly incorporated*!
Could it be that the notion of *change*, as defined by the derivative, is *linearly incorporated*?
### Deep Intuition via an example
Consider again our situation from above:
$\frac{dz}{dx} = \frac{dz}{dy} \frac{dy}{dx}$
We still may want further intuition about why we must multiply in order to get our final derivative. A simpler way to reason about this is if we think about *ratios*. Let's make this concrete: $z = apples$, $y = days$, $x = year$. We want to know how many apples are picked per year. Now let's say we work 200 days per year:
$\frac{dy}{dx} = 200$
And we pick 15 apples per day:
$\frac{dz}{dy} = 15$
How can we determine how many apples are picked per year? Clearly we just multiply these two ratios!
$\frac{dz}{dx} = \frac{dz}{dy} \frac{dy}{dx} = 15 \times 200 = 3000$
This is just another example of a **conversion process**, where we have things in one unit (apples per day) and want to convert it to another unit (apples per year). In order to do so we need another piece of information (days per year). Once we have that we can simply multiply them together!
To be crystal clear, the reason that we multiply is that it allows the proper cancellation of terms in order to get what we want! In this case it is very clear - we want to end up with $\frac{dz}{dx}$. The only way to get their given out other pieces of information is to multiply. Any other operation simply won't yield the desired final result. That is why we must multiply.
---
Links to: [Mathematics MOC](Mathematics%20MOC.md) [Calculus MOC](Calculus%20MOC)
References:
* [Visualizing the chain and product rule](https://www.youtube.com/watch?v=YG15m2VwSjA)
* [Comparing two numbers via quotient](https://math.stackexchange.com/questions/1682771/why-to-use-ratios-to-compare-two-quantities-and-not-difference)
* [Relative change vs relative difference](https://en.wikipedia.org/wiki/Relative_change_and_difference)
* [chain rule intuition](https://math.stackexchange.com/questions/62614/chain-rule-intuition)
* [The Intuitive Notion of the Chain Rule](https://webspace.ship.edu/msrenault/geogebracalculus/derivative_intuitive_chain_rule.html)
* [Lesson 13: Deep Learning Foundations to Stable Diffusion - YouTube](https://youtu.be/vGdB4eI4KBs?t=2327)
Notability: Chain rule intuition