Neural Networks MOC - Nate's Notes

# Neural Networks MOC 1. [Autoencoders](Autoencoders.md) 2. [Transformers](Transformers.md) 3. [Stable Diffusion](notes/Stable%20Diffusion.md) 4. [Cross Entropy and Neural Network Mechanics](Cross%20Entropy%20and%20Neural%20Network%20Mechanics.md) * [Lottery Ticket Hypothesis](Lottery%20Ticket%20Hypothesis.md) * [SynFlow-Pruning-Neural-Networks-without-data](SynFlow-Pruning-Neural-Networks-without-data.md) ### Different ways to think about Neural Networks 1. [Neural Networks are Matrix Program Search](Neural%20Networks%20are%20Matrix%20Program%20Search.md) 2. As a locality sensitive hash (see below - TODO: make that into a note) 3. ### Interesting Ideas - - - ##### From [this talk with Yan Lecun]() * Neural networks are transforming the basis functions, not the data (TODO: Explain) * Neural networks can be thought of as a [**locality sensitive hash**](Locality%20Sensitive%20Hashing%20(LSH).md) (see Chollet, here) * The [curse of dimensionality](High%20Dimensional%20Spaces.md) relates intimately with the extrapolation problem. Lots of engineering makes neural nets work very well on specific tasks, and all architecture can *leak domain specific information* into your model. * We often think of there being a single, unified latent space at each layer. However, that is not the case! Each input will “toggle”, in an input specific way, a set of hyperplanes, by virtue of Relu’s it does or does not activate. Therefore, different populations of input samples will reside in different latent spaces at any one layer**, defined by the activated set of hyperplanes. So, instead of a single unified latent space, we have latent *spaces* (plural) at each layer. * Do NN’s learn a manifold? Or do they simply learn class boundaries(this is indeed what they are *optimized* to do)? Learning a boundary manifold seems very different from learning intrinsic connection manifolds! Optimizing for *separation* seems like it would give a very different outcome than optimizing for *connection*. * Neural nets learn what to ignore in the ambient space * The latent space that a neural net learns is effectively “[stitched together](https://youtu.be/86ib0sfdFtw?t=1213)”. Depending on what cell you fell into in the input (ambient) space, a different affine transformation will be applied, sending you to a different region of the latent space. So the latent space is stitched together. It is better to think of neural networks as *vectorizing* the input space, much like a vector search engine does using [Locality Sensitive Hashing (LSH)](Locality%20Sensitive%20Hashing%20(LSH).md) * What priors encode information needed for node classification (see staircase example, 31:00, [here](https://youtu.be/86ib0sfdFtw?t=1852)) * ![600](1_%20Prof.%20YANN%20LECUN_%20Interpolation.png) * Data density isn’t as important as amount of change in a region (think of a derivative), see [here](https://youtu.be/ZaOp1KNhpUQ?t=862). How to think about this when learning probabilities? (TODO: Explain). Where you have high density of data isn’t necessarily where the underlying function is changing the most. You want data where the function changes the most. Every time the function changes the most is where a linear interpolate will get *worse*. In other words, where there is a large amount of data, there may not be much information! * Neural networks are parameterized models and they essentially place all of these basis functions in the ambient space. Where those functions are placed is of interest to us. By following the gradient, the neural network will cluster basis functions to where the error is high. * The further away you get from the convex hull, the greater the uncertainty ![600](61_%20Prof.%20YANN%20LECUN_%20Interpolation%20%20Extrapolation%20and%20Linearisation%202.png) TODO: Understand neural networks and basis functions. This is becoming a key theme. See papers and toy example post ### Intuitions **Introducing Complexity** Fundamentally when we use a neural network our preprocessing and backpropagation steps will not change! What will change is the way we do the *forward pass*. See more [here](https://youtu.be/PaCmpygFfXo?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&t=6429). **Onehot Vectors** When we have a one hot vector multiplied (transformed) via a matrix, we can think of that one hot vector as “plucking out” (indexing into) the corresponding *row* of the matrix where the one hot vector is $1$. In other words, one hot vectors can be thought of as *lookups* into matrices. This is how word embeddings function. Given a specific word, we get its one hot vector, then pass it into a matrix $W$. The effect of this is to *pull out* a row of $W$. Once the network has been trained, this row of $W$ can be thought of as an *embedding* representation of the original word (associated with our one hot vector). **Knowledge Transfer via Word Embeddings** Consider the 2003 paper, [A Neural Probabilistic Language Model](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) (discussed [here](https://youtu.be/TCH_1BHY58I?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&t=300)). The problem it addresses is that as you try and expand the dependencies considered in an NLP modeling task (say, the predict the next word given the current word), you run into the curse of dimensionality (and too little data). Here is what I mean: ``` my dog ran after the cat my dog ran after a cat ``` These are two similar sentences. However, imagine if we were at the fifth word in each respective sentence, trying to predict the next word. Because `the` and `a` are different, we would end up with different preceding bigrams: ``` (after, the) (ran, after, the) ... (after, a) (ran, after, a) ... ``` So we see that these bigrams would get their *own counts* (frequencies of occurrence), even though they fundamentally mean very similar things. Is this a problem? Yes! It will lead to data sparsity and force us into higher dimensions than needed. Here we could say that `the` and `a`, each originally given an entire one-hot dimension, could likely be *compressed* into a single dimension. The paper proposes an approach to *learn an **embedding*** for each word, and if words are close together information can be shared across them. Then, instead of a sentence being a sequence of one-hot vectors, a sentence would be a sequence of embedded feature vectors. And if `the` and `a` had similar embedded feature vectors, the resulting sequence of embedded feature vectors representing each sentence would be incredibly similar! ### Training [Training Neural Networks](Training%20Neural%20Networks.md) --- Date: 20211112 Links to: [AI MOC](AI%20MOC.md) Tags: References: *