# Does Linearity provide *information*
#### Linear transformation
Given a map $L(x)$, $L$ is [linear](Linearity.md) if it satisfies the two properties:
* **Additivity**: $L(x + y) = L(x) + L(y)$
* **Homogeneity of degree 1**: $L(ax) = a L(x) \;\; \forall \;\; a$
This is an incredibly restrictive property, and in sense, the restrictiveness provides us with *information*. It is this information that allows us to describe a linear transformation from $\mathbb{R}^2 \rightarrow \mathbb{R}^2$ via only a 4 numbers, which we encode as a 2 x 2 matrix. Specifically, we only need to know where $L$ transforms our basis vectors in order to know where it will transform *any* vector.
To be concrete, we can define our basis vectors as $\hat{i} = \begin{bmatrix} 1 \\ 0 \end{bmatrix}$ and $\hat{j} = \begin{bmatrix} 0 \\ 1 \end{bmatrix}$.
So, let's say that $L$ transforms our basis vectors as follows.
$L(\hat{i}) = \begin{bmatrix} 2 \\ 1 \end{bmatrix}$
$L(\hat{i}) = \begin{bmatrix} 3 \\ 0 \end{bmatrix}$
We can then simply encode this information in a 2 x 2 matrix and let that represent $L$:
$L = \begin{bmatrix} 2 & 3 \\ 1 & 0\end{bmatrix}$
The key idea here is that we require only *four numbers* to determine where any vector will be taken. Again, this is due to the *structure* that linearity imposes upon $L$. Visually, the transformation has the property that "lines remain lines", and the "gridlines" remain parallel and evenly spaced:

Obviously gridlines are just an artifact that can help us visualize our space, the key idea is that the *entire space* is transformed in this manner.
#### Nonlinear transformation
Now, let us have a nonlinear transformation, $f$, defined as:
$f \big( \begin{bmatrix} x \\ y \end{bmatrix} \big) = \begin{bmatrix} x + sin(y) \\ y + sin(x)\end{bmatrix}$
The fact that it is nonlinear means that lines do not remains lines after the transformation, and gridlines certainly don't remain parallel and evenly spaced. Visually this looks like:

Again, the gridlines are an artifact of the visualization, but the key idea is that the space itself is transformed in a way that is far more complex than our simple linear transformation, $L$. From an information perspective, it seems as though *far more* information is required to describe where each point is transformed via $f$; i.e. we need more than 4 numbers. This makes sense and I realize it is where multivariable calculus and the Jacobian may come into play.
My question is this:
> Even though $L$ can describe our transformation simply via 4 numbers, encoded in a 2x2 matrix, it still seems to me as though we still need to perform the computation to see where $L$ has taken a vector $\vec{v}$.
> For instance, in the transformation below:
> $L \big( \begin{bmatrix} 2 \\ 1 \end{bmatrix} \big)= \begin{bmatrix} 2 & 3 \\ 1 & 0\end{bmatrix} \begin{bmatrix} 2 \\ 1 \end{bmatrix} = \begin{bmatrix} 7 \\ 2 \end{bmatrix}$
> We still needed to actually perform the matrix-vector multiplication; we still needed to actually carry out some computation.
> In the case of $f$ in the transformation below:
> $f \big( \begin{bmatrix} 2 \\ 1 \end{bmatrix} \big) = \begin{bmatrix} 2 + sin(1) \\ 1 + sin(2)\end{bmatrix}$
> We will still just need to plug in our vector and carry out the computation. So in both cases we need to **carry out the computation**. So, *how is that $L$ actually provided us with information*?
## Answer
When thinking about the information contained in a function, we can think about: *what do we need to describe the function completely?*
Concretely, suppose that we know $f: \mathbb{R}^2 \rightarrow \mathbb{R}^2$ and that $f \big( \begin{bmatrix} 1 \\ 1 \end{bmatrix} \big) = \begin{bmatrix} 2 \\ 3 \end{bmatrix}$. If we are *told* (given information) that the function is **linear** then we can completely describe $f$ in the sense that we can now solve for $f ( \begin{bmatrix} x \\ y \end{bmatrix})$ for any $x$ and $y$.
#### Key Intuition
We must think about the information content in the context of *not knowing* the function $f$ (this is analogous to [Shannon Entropy](Information-Theory%201.md) , where we *gain information* upon observing the outcome of a random experiment. This outcome was *unknown* prior to conducting the experiment). So, we can ask the question: "Assuming that we don't know $f$, what does it tell us if we uncover that $f$ is linear?"
In the case that $f$ is linear, we gain an *incredible* amount of information. Specifically, *assuming that we do not know $f$*, we can describe where $f$ takes *any vector* given the following information:
* We are told $f$ is linear
* We are told where $f$ takes a single input vector $\begin{bmatrix} x \\ y \end{bmatrix}$
This information has actually *told us* what $f$ is. If $f$ was nonlinear, we would need *far* more information to uncover $f$.
So, to ensure this is clear:
> To reason about how much *information* a function provides, we must start from assuming that we know nothing about $f$. Then we think about what the *knowledge* that learning $f$ is *linear* tells us.
#### Technicalities
There's a bit of a hierarchy here that's useful to think about:
1. Constant
2. Linear
3. Polynomial
4. Analytic
5. Smooth
Each of the above falls into a different category of "how much do I need to know to recover the function completely".
1. Just evaluate at a single point and you're set--you know f for any other input
2. You need to know f at two points
3. You need to know f at n+1 points, where n is the degree of the polynomial
4. You need to know f on **a set with a limit point**
5. You need to know the function on a **dense set**
#### More Thoughts....

It's somehow the difference between saying
- a linear function is basically just two points--four numbers--that's extremely good compression
- a linear function allows you to recover an entire copy of the real numbers from just four numbers--that's a lot of info
---
It almost feels like there's a tradeoff that's secretly doing some work--like there's a certain amount of information in a "suitable" subset of $\mathbb{R}^2$ (suitable here meaning it's the graph of something)
and the nicer the function is, the more of that info it can absorb in some way and the worse it is the more info has to be conveyed by sampling more of the subset