Covariance Matrix - Nate's Notes

# Covariance Matrix ### As a *linear transformation* Consider a set of data points in $\mathbb{R}^2$: ![500](Screen%20Shot%202022-07-20%20at%202.50.04%20PM.png) We have a coordinate system, $X$ and $Y$, being defined entirely based upon our choise of *basis vectors* (red and green). In this case our basis is the canonical $e_1$ and $e_2$, the *standard basis*. We could of course choose a different basis, and thus have a different coordinate system if desired. Now, let us have a covariance matrix, $\Sigma$: $\Sigma = \begin{bmatrix} 1 & 0.5 \\ 0.5 & 1 \end{bmatrix} $ Let us *transform* our data via the the covariance matrix. Note: we can think of the covariance matrix as a linear transformation, specifically one that *generates* our data: ![500](Screen%20Shot%202022-07-20%20at%202.52.15%20PM.png) We see that our data ends up in now being shaped in an ellipse. We also see that after the transformation $e_1$ and $e_2$ are both knocked off their span, having been *rotated*. Is there a basis that we could chose to describe our data with that, after the covariance transformation, only results in the basis being *scaled*, not rotated? Let’s try to use the *eigenbasis*. Below we can see the eigenvectors (of $\Sigma$) and our original data: ![500](Screen%20Shot%202022-07-20%20at%202.57.06%20PM.png) Note that if we were truly using an eigenbasis, we would no longer have an $X$ and $Y$ coordinate system & tick marks. This is because they $X$ and $Y$ coordinate system are entirely based upon our choice of basis! Which would no longer be $e_1$ and $e_2$. With that said, if we then apply the covariance transformation we end up with: ![500](Screen%20Shot%202022-07-20%20at%202.57.49%20PM.png) We see that our eigenvectors were simply *scaled*. I have been talking of the covariance matrix as having *generated* our data set (the elliptical set of points). However, there are two useful things to keep in mind about this intuition: 1. In general applied settings, we only observe the *final data set*, not an original space of points that is then transformed. We *learn* an *estimate* of the covariance matrix and *assume* (perhaps incorrectly) that it was responsible for generating our data. This assumption means we believe a linear transformation generated our data. It also requires an original space of points (the circular set of points, pre transform). Where do those come from/what do they represent? It is not entirely clear since they are never observed in reality. 2. We must remember we are dealing with two matrices. We have a matrix encapsulating our data set (all observations, each of which lives in $\mathbb{R}^2$). And then we have the $2 \times 2$ covariance matrix, $\Sigma$, that encodes the relationships between the different dimensions (as defined by our basis vectors) of our data. ### Wait so what is the upshot of all of this? We start with a data matrix, let’s call it $X$, consisting of $n$ $d$-dimensional observations: $X = \begin{bmatrix} - \mathbf{x}_1 - \\ - \mathbf{x}_2 - \\ \vdots \\ - \mathbf{x}_n - \end{bmatrix}$ Where $x_i \in \mathbb{R}^d$. We can calculate the covariance matrix of $X$, which we call $\Sigma$. We can assume that $\Sigma$ *generated* our data $X$. We can find the eigenvectors of $\Sigma$, which will correspond to the directions of greatest variance of our data (where each eigenvector will have an associated eigenvalue that represents the amount of variance). We can use these eigenvectors as our *basis vectors* to describe our data. This is so powerful because have found the dimensions of greatest variance, aka *greatest information content*. We can then *get rid* of dimensions will low information content, hence reducing our dimensionality. So *that* is the benefit! The benefit is *not* that: * Finding the eigenvectors of $\Sigma$ allow us to find directions that are simply scaled and not rotated. In this context that isn’t all that useful The benefit is: > Finding a set of basis vectors to describe our points (i.e. *changing our basis*) that captures the dimensions of **greatest variance** (information content) allows us to then *retain* the dimensions with the most information and *discard* those with low information content. This is the entire idea behind [Principle Component Analysis](Principle%20Component%20Analysis.md) --- Date: 20220720 Links to: [Linear Algebra MOC](Linear%20Algebra%20MOC.md) Tags: #review References: * [intuitiveml/covariance_matrix_as_transformation.ipynb at master · NathanielDake/intuitiveml · GitHub](https://github.com/NathanielDake/intuitiveml/blob/master/notebooks/unsupervised%20learning/covariance_matrix_as_transformation.ipynb)