LSTM - Nate's Notes

# LSTM The main drawback of an LSTM is that it cannot be parallelized (as can a [Transformers](Transformers.md)). It also has the problem of a vanishing gradient. ![](Screenshot%202023-06-20%20at%207.01.53%20AM.png) ![](Screenshot%202023-06-20%20at%207.02.05%20AM.png) ![](Screenshot%202023-06-20%20at%207.09.45%20AM.png) The core idea of an LSTM is to add the notion of **gates**. Specifically, it's architecture contains three gates. #### 1. Forget Gate ![](Screenshot%202023-06-20%20at%207.08.02%20AM.png) The forget gate looks at the *previous hidden state*, $h_{t-1}$, as well as the the *current input token*, $x_t$, and determines *what part of the cell state, $C_{t-1}$, it should keep* (equivalently, what part of the cell state it should *forget*). Mechanically, we can see that $W_f$ is a learned matrix that will transform the concatenation of $h_{t-1}$ and $x_t$, and then pass it through a sigmoid, yielding a vector of entries between 0 and 1. This will be pointwise multiplied by $C_{t-1}$, determining what parts of $C_{t-1}$ will be forgotten. >**Why is this used?** > Consider the example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject. #### 2. Input Gate (output old cell state, $C_{t-1}$) ![](Screenshot%202023-06-20%20at%207.14.47%20AM.png) We then have a sigmoid layer called the **input gate** that takes the concatenation of the previous hidden state and current $x$ and determines what part of it will be included in the cell state update (i.e. the updating of the cell state from $C_{t-1} \rightarrow C_t$). Again, this is basically a linear transform (via the matrix $W_i$) and a sigmoid that acts as a *mask* that determines what parts of $h_{t-1}$ and $x_t$ we should keep. At the same time, the $tanh$ layer (a linear transform $W_c$ followed by a $tanh$ activation) takes in $h_{t-1}$ and $x_t$ and creates a vector of candidate values, $\tilde{C}_t$. These candidate values are pointwise multiplied by the input gate mask that was just created, in order to create a final update vector. We can now actually update our cell state. This consists of two steps, making use of our **forget gate** and **input gate**: 1. Apply the forget gate like a mask (pointwise multiply $f_t$ with the old cell state $C_{t-1}$) 2. Take the resulting vector and pointwise add the result of the input gate to it. This yields $C_t$. ![](Screenshot%202023-06-20%20at%207.20.48%20AM.png) >**Why is this used?** > In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting. This is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps. #### 3. Output Gate ![](Screenshot%202023-06-20%20at%207.23.28%20AM.png) The last thing to address is what hidden state do we output? The output is determined base on: 1. The output gate. This is a sigmoid layer that take takes in $h_{t-1}$ and $x_t$ and creates a mask. 2. This mask is then pointwise multiplied by $tanh(C_t)$, yielding our new hidden state, $h_t$. >**Why is this used?** > For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next. ### A Note on the Architecture It is worth asking how we converged on the above architecture. There are many gates and it can feel somewhat hard to imagine how this was conjectured. Andrej Karpathy talks about it in [this lecture](https://youtu.be/yCC09vCHzF8?t=3273) that this mainly due to many people experimenting with this for a long time and slowly converging on this architecture as a reasonable implementation. ### What problem was the LSTM trying to solve? The problem is that with vanilla RNN's, as the sequence grows in length it is harder to connect *long term dependencies*. Such as a situation where we are having a conversation and in the first few sentences I tell you I live in Boulder, CO, and then many paragraphs later I ask you where I live. The information about Boulder CO may be very far back at this point. and the hidden state may not be able to effectively represent it any more. ![](Screenshot%202023-06-20%20at%207.34.17%20AM.png) ### What are the big ideas to remember with LSTMs? The reason that LSTM is far more effective than a vanilla RNN is that is effectively is the same idea of a RESNET, coupled with the concept of a *forget gate* ([see here](https://youtu.be/yCC09vCHzF8?t=3400)). Because we have a super highway of gradients with an LSTM we have no vanishing gradient, whereas with an RNN we do. --- Date: 20230614 Links to: Tags: References: * [Understanding LSTM Networks -- colah's blog](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)