Singular Value Decomposition

# Singular Value Decomposition ## Outline (Writing) > Key Idea: For a linear transformation mapping from one vector space to another, there always exists a basis in the domain and a basis in the codomain such that the matrix mapping from one to the other is diagonal. Also, a matrix is always defined wrt a basis! Biggest ideas: see notability (especially, Ways to View a Matrix) 1. As a Linear Transformation 2. Change of Basis 1. How we can view a change of basis as a rotation 3. Linear regression 1. Range, Domain, Codomain, etc 2. [The Linear Algebra View of Least-Squares Regression | by Andrew Chamberlain, Ph.D. | Medium](https://medium.com/@andrew.chamberlain/the-linear-algebra-view-of-least-squares-regression-f67044b7f39b) 3. [Dear linear algebra students, This is what matrices (and matrix manipulation) really look like - YouTube](https://www.youtube.com/watch?v=4csuTO7UTMo) 4. [What is Column Space? — Example, Intuition & Visualization | by Aerin Kim | Towards Data Science](https://towardsdatascience.com/what-is-column-space-with-a-machine-learning-example-8f8a8d4ec6c) 5. Relationship to column space 4. SVD thought of as a rotation, scale, rotation 1. AMAS blog post 2. [What is the Singular Value Decomposition? - YouTube](https://www.youtube.com/watch?v=CpD9XlTu3ys&t=201s) 5. SVD as iterative line fitting 6. Jeremy Kun 6. SVD as it relates to transpose 7. SVD and it's connection to PCA ### Notes to organize * _E1 and e2 as [1,0] and [0,1] end up being isomorphic when they refer to themselves as basis! 1,0 will always refer to simply the first basis vector, but in the standard basis that is literally itself!_ * _Svd works because linear transforms are so heavily constrained. Also, mention how much information a linear transform has packed in due to its constraints! Or is it low information…?_ ### TLDR: A rotation, scaling, rotation > All linear transformations can be decomposed (via the singular value decomposition) into: > 1. A rotation > 2. A scaling > 3. A subsequent rotation Visually this can be seen in [svd visualization.excalidraw](svd%20visualization.excalidraw.md). We can state it more formally as: > A matrix $A$ can be decomposed to: > $A = U \Sigma V^T$ > Where we have: > 1. $V^T$ applies a rotation (because it is orthogonal matrix) that rotates the right singular vectors to the standard basis (singular vector with largest singular value lands on first standard basis vector, and so on). Think of this as a very useful change of basis in the domain (input space). > 2. 2. $\Sigma$ scales and either adds or removes dimensions > 3. 3. $U$ rotates the standard basis to align with the left singular vectors (a change of basis in the output space). ### Key Idea We are looking for a set of orthogonal vectors that, after being transformed, are still orthogonal. These vectors are what *define* the SVD. We can think of this as a rotation of our domain and a rotation of our codomain. ### Best Example Watch this video [here](https://www.youtube.com/watch?v=vSczTbgc8Rc&list=PLWhu9osGd2dB9uMG5gKBARmk73oHUUQZS&index=4&t=74s). Consider a matrix $A^{m \times n}$: $A: \mathbb{R}^n \rightarrow \mathbb{R}^m$ The singular value decomposition states that it can be expressed as: $A = U \Sigma V^T$ Where the following properties hold: * $U$ and $V$ are orthogonal (meaning they are *rotation* transformations, also able to be viewed as [change of basis](Rotation%20and%20Change%20of%20Basis.md)) * $\Sigma$ is a rectangular diagonal matrix (scaling matrix) Now, $A$ is clearly rectangular in this case, assuming $n \neq m$. We can construct ***symmetric*** matrices from $A$ via: $S_L = AA^{T_{m \times m}}$ $S_R = A^T{A^{n \times n}}$ We know that symmetric matrices have some beautiful properties; one of them is: > A symmetric matrix will have **orthogonal eigenvectors**. Recall that if we take an orthogonal matrix of eigenvectors and apply it to a space, it will transform the standard basis to the eigenvectors. If we take the transpose of the orthogonal matrix of eigenvectors, it will rotate the eigenvectors to the standard basis. Now, the eigenvectors of $S_L$ are known as the **left singular vectors** of $A$, and the eigenvectors of $S_R$ are known as the **right singular vectors** of $A$. Note that the eigenvalues of $S_L$ will equal those of $S_R$ (besides those that are left over, which will be $0$). ![700](SVD%20Visualized,%20Singular%20Value%20Decomposition%20explained%20_%20SEE%20Matrix%20,%20Chapter%203%2010-0%20screenshot.png) We can now breakdown the decomposition as follows: 1. $V^T$ applies a rotation (because it is orthogonal matrix) that rotates the right singular vectors to the standard basis (singular vector with largest singular value lands on first standard basis vector, and so on) 2. $\Sigma$ scales and either adds or removes dimensions 3. $U$ rotates the standard basis to align with the left singular vectors This is visualized in full detail in [svd visualization.excalidraw](svd%20visualization.excalidraw.md). ### Another way to think about it We wish to find an orthogonal set of vectors that, once transformed by matrix $A$, remain orthogonal. It is this set of vectors that *defines* the singular value decomposition of $A$: ![](What%20is%20the%20Singular%20Value%20Decomposition_%201-25%20screenshot.png) ### Jeremy Kunn Perspective See these two deep dives: * [Singular Value Decomposition Part - Perspectives on Linear Algebra](Singular%20Value%20Decomposition%20Part%201%20-%20Perspectives%20on%20Linear%20Algebra.pdf) * [Singular Value Decomposition Part 2 - Theorem, Proof, Algorithm](Singular%20Value%20Decomposition%20Part%202%20-%20Theorem,%20Proof,%20Algorithm.pdf) Keys ideas: * Factoring a matrix via SVD provides an alternative and more useful way to represent the process of people rating movies. By changing the basis of one or both vector spaces involved, we isolate different orthogonal characteristics of the process. Why is this orthogonality crucial? Then there is *no overlap/redundancy/shared information* between our basis vectors. If two basis vectors are very similar, then knowing one nearly tells you the other. If you are able to create an orthogonal basis, and then use the eigenvalues to infer how much information lies along that basis (larger eigenvalues mean to more information), then you can discard basis with less information, retaining those with the most information. This is a form of dimensionality reduction. * We can view SVD as finding the best (linear) subspace approximation. We can get better and better approximations with each additional (orthogonal) dimension added (where we add dimensions based on their associated eigenvalue) ![](Screen%20Shot%202022-07-19%20at%204.36.22%20PM.png) * So, we are specifically starting by finding a vector, $v$, for which the projection of all points in our dataset onto the span of $v$ has the minimum projection distance (sum of $z$‘s above). Equivalently, we wish to find $v$ such that the *projected distances* ($y$‘s) are *maximized*. Intuitively, if we maximize the projected distances we are finding the direction of greatest variance! This $v$ turns out to be the *eigenvector* (of the covariance matrix of data) with the *largest eigenvalue*. * How do eigenvectors fit in? Eigenvectors will point in the direction of greatest variance. You look up an algebraic proof. ### New JK Insights * Let us have a linear map $A: \mathbb{R}^n \rightarrow \mathbb{R}^m$. This is simply a map from the $n$ dimensional vector space $\mathbb{R}^n$ to the $m$ dimensional vector space $\mathbb{R}^m$. This linear map is effectively *abstract*, *not concrete*. That is until we 'instantiate' it via a concrete matrix. * This involves selecting a basis in $\mathbb{R}^n$ and then describing where each of these basis vectors that were living in $\mathbb{R}^n$ land in $\mathbb{R}^m$ * One of the biggest keys to realize is that $A$ isn't really about either $\mathbb{R}^n$ or $\mathbb{R}^m$. It is about the *map* between them! It is about the process of moving from one space to the other. It isn't really about the spaces themselves. ### Direction of greatest Variability See [PCA](PCA.md) and [Covariance Matrix](Covariance%20Matrix.md). Jkun’s article [Singular Value Decomposition Part 2: Theorem, Proof, Algorithm – Math ∩ Programming](https://jeremykun.com/2016/05/16/singular-value-decomposition-part-2-theorem-proof-algorithm/) makes it very clear how we will find the directions of greatest variability; that is literally what out objective will be! To see how this ends up being the eigenvectors, there are a host of proofs online, none of which end up being particularly enlightening in my opinion. ### A Linear Transformation, in terms of SVD We can think of a linear transformation as projecting a vector onto row space (this of this as a form of rotation), scaling (due to dot product) and then rotation result to the basis of the codomain. But even if it isn’t a rotation per se, it doesn’t really matter. It is literally what we defined the map to be! See good notes for more detail, and image below: ![](Screen%20Shot%202022-07-19%20at%204.32.42%20PM.png) ### Optimization, or the main objective The main objective is to simply find the best approximating 1-d subspace for a given data set. This is done by finding the subspace for which total length of the projections of the datapoints onto the subspace is maximized. Then, this is repeated, only the subsequent subspace must be orthogonal to the first. This is then repeated until you have achieved your desired acccuracy. ### See notes in notability (ways to view a matrix!) --- Date: 20220314 Links to: [Linear Algebra MOC](Linear%20Algebra%20MOC.md) Tags: References: * [Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.](https://www.notion.so/SVD-Matrix-Transpose-Linear-Regression-Column-Space-Basis-3419582af5884d9f89638b52403d10ae) * [Singular Value Decomposition as Simply as Possible](https://gregorygundersen.com/blog/2018/12/10/svd/) * [What is the Singular Value Decomposition? - YouTube](https://www.youtube.com/watch?v=CpD9XlTu3ys&t=287s) * [SVD Visualized, Singular Value Decomposition explained | SEE Matrix , Chapter 3 - YouTube](https://www.youtube.com/watch?v=vSczTbgc8Rc&list=PLWhu9osGd2dB9uMG5gKBARmk73oHUUQZS&index=4&t=74s) * [Best proof, here](https://www.youtube.com/watch?v=CpD9XlTu3ys&t=169s)