### Why do Neural Networks work at all? This is discussed in this great youtube video [here](https://youtu.be/ZVVnvZdUMUk?t=378). To summarize, we can say: * Neural Networks are composed of many layers of linear transformations (matrices) that have weights (**parameters**) * These parameters are randomly initialized to begin with. * Researchers have often wondered, "Neural Networks have so many parameters, how can they even generalize?" * The reason is the following: if we have a Neural Network, we throw so many parameters at it, *some of the parameters* (a subset of the parameters) are going to be randomly initialized in such a *beneficial way* that training will make the network perform well. In other words, it is *initialization* and *stochastic gradient descent*. * By over parameterizing our NN so much, we give it *combinatorically many* subnetworks to chose from. These combinatorics almost guarantee that there is a good subnetwork in here that can perform well. * The question still remains, how does the NN *find* the subnetwork inside of this larger network? A few notes on the above: * This seems oddly familiar to [the JL Lemma](Johnson-Lindenstrauss-Lecture.md), namely the idea of being able to take a high dimensional space and embed it in a low dimensional space. * We can think of our [Parameter Space](Parameter%20Space) as the space of all possible weights. By increasing the number of parameters, we increase the size of that space. However, we also increase the chances that a *subset* of that space will be randomly initialized in such a way that it defines a lower dimensional manifold that is tailored for prediction. The more weights we randomly can initialize, the more likely a subset exists that excels at prediction. --- Date: 20211112 Links to: Tags: #review References: * [Why do neural networks work at all](https://youtu.be/ZVVnvZdUMUk?t=378) * [Good review blog post](https://roberttlange.github.io/posts/2020/06/lottery-ticket-hypothesis/)