# Transformers
### Best overview
See my notebook here [nano-gpt/gpt-dev.ipynb at main · NathanielDake/nano-gpt · GitHub](https://github.com/NathanielDake/nano-gpt/blob/main/gpt-dev.ipynb)

##### Keys, Values and Queries
We compute the dot product between the **query** and the **keys**. We can think of this as an *indexing scheme* into the transformers *memory* of **values**. See more from Yannic here:
[Attention Is All You Need](https://youtu.be/iDulhoQ2pro?t=1390).
We can think of this mechanism as follows:
* The *encoder* says “here are a bunch of things about the source sentence that you may find interesting”. These are the *values*. We can think of these as things like the name, the tallness, the weight of a person.
* The *keys* are way to *index* the *values*. So it is really saying: “here are a bunch of things that you may find interesting, and here is how you might *address* these things”. These are the *keys*. The would be things such as ‘name’, ‘height’, ‘weight’.
* The other part of the network, the *decoder*, builds the *queries*, which we can think of as saying “I would like to know certain *things*.” This portion of the network may say “I would like the name”, so the query would be ‘name’, and it would align with the ‘name’ key, have a high dot product, and effectively address into the correct value, the actual name of the person.
Another way to think of this is as follows:
* We start with our embedding input. Say it is 256 dimensions. We split into several parts, say 8 parts (in the image below $h = 8$). Now each of the 8 parts have 32 dimensions each. They will all be sent in to linear layers.

The resulting attention can be formulated as:
$\text{Attention} (Q, K, V) = \text{softmax} \Big( \frac{QK^T}{\sqrt{d_k}} \Big)V$
After we have performed our multiple attentions ($h = 8$ in total), we then concatenate them so that we obtain the same size as the input embedding again. That is then sent through a linear layer and that is the output of the multihead attention (on right in image above).
#### Self Attention propagates information *between vectors*
> More importantly, **self attention** is the only operation in the whole architecture that **propagates information _between_** vectors. Every other operation in the transformer is applied to each vector in the input sequence without interactions between vectors.
This is very important to remember. Self attention is where the power of a transformer, based on it’s ability to share information, comes from.
This is the basic intuition behind self-attention. The dot product expresses how related two vectors in the input sequence are, with “related” defined by the learning task, and the output vectors are weighted sums over the whole input sequence, with the weights determined by these dot products.
Another way to think about attention is that it dynamically allocates it's weights to *some* values and not others. This dynamic allocation is based on the dot product between the queries and keys.
#### Positional Encoding
Why is the Positional encoding block needed at the start of the transformer? This is because out of the box the transformer is **permutationally invariant**. This means that if we have a sentence and change the order of the words the transformer will produce the exact same output. This is obviously not good in sequence to sequence tasks. Order of words matters in language!
#### Decoder Block: Masked Multihead Attention
Why is the mask applied in the decoder block? That is *specifically* needed in the original use case of the paper, *sequence to sequence translation*. In that case, you need to ensure that the model isn’t able to *look forward in time*. That is why this is sometimes called *causal self attention*.
### These things are hard to tune
- Learning rate schedule, warmup strategy, decay settings are all hard to tune [https://thegradient.pub/transformers-are-graph-neural-networks/](https://thegradient.pub/transformers-are-graph-neural-networks/) [https://twitter.com/chaitjo/status/1229335421806501888?s=12&t=iFfvjsPULVNSCARsnGABLg](https://twitter.com/chaitjo/status/1229335421806501888?s=12&t=iFfvjsPULVNSCARsnGABLg)
### Message Passing
Transformers are basically **message passing** (see more [here](https://youtu.be/XfpMkf4rD6E?t=1538)).
### Attention frees us from euclidean space!

# Neel Nanda Intuitions
[What is a Transformer? (Transformer Walkthrough Part 1/2) - YouTube](https://www.youtube.com/watch?v=bOYE6E8JrtU&list=PL7m7hLIqA0hoIUPhC26ASCVs_VrqcDpAz&t=848s)

* The **residual stream** is the *central object* of a transformer.
* How the model remembers things, moves informations between layers for composition, and can move between positions including other tokens.
* The sum of all previous outputs of layers of the model is the input to each new layer
* **Attention**
* moves information from prior positions in the sequence to the current token. This is done for *every* token in parallel using the same parameters.
* The produces an **attention pattern** for each destination token, a probability distribution of prior source tokens (including the current one) weighting how much information to copy.
* Fundamental point: Figuring out *which* source tokens to copy info from is a separate circuit from figuring out *how* to copy that information.
* Note that attention is the only bit of a transformer that moves information between positions.
* Made of up of $n$ heads - each with their own parameters, own attention pattern, and own information on how to copy things from source to destination.
* The heads act independently and additively, we just add their outputs together and back to the stream
* For every pair of tokens we have a weight called the attention pattern (from the destination token to the source token) that choses how much information we copy from that token to the current one.
* The way that we copy information only depends on the learned parameters of the model. It does not depend on the destination token or the source token.
* However, *what information we copy does depend on the source token's residual stream.*
* Note: when we say copy we mean apply a linear map
* **MLP** - Multilayer perceptron
* Standard neural network. Singly hidden layer. Linear map -> GELU activation -> linear map
* Middle dimension $d_{mlp} = 4 \times d_{model}$
* The ratios don't really matter that much
* Intuition - once attention has moved relevant information to a single position in the residual stream, MLPs can actually do computation, reasoning, lookup information, etc
* This is a big open problem in mech interp!
* Underlying intuition: Linear map -> non-linearity -> linear map is the most powerful force in the universe and can approximate arbitrary functions.
* **Unembed**
* Apply a linear map going from final residual stream to a vector of logits. This is the output.
### General Intuitions
* Shapes are basically *variable types* for tensors
### Attention
One thing to observe about attention is that the attention scores are the dot products of the queries and keys. But we can (check out the code) that `q` `k` and `attention_scores` are just a string of 3 einsums. Einsums are just **linear maps**. The [Composition of Linear Maps](Composition%20of%20Linear%20Maps.md) of linear maps is just a big linear map. This means that the attention scores are just a [Bilinear map](Bilinear%20map.md) of the inputs, where we form this big matrix that is $W_Q W_K^T$ . But the key idea is that the queries and keys are actually a *distraction*! The model has actually learned a low rank factorized matrix for each head that is d_model by d_model. The same thing happens with values.
So it turns out that the only thing that a head is doing is that it has learned two low rank factored out d_model x d_model tensors.
### Attention Lookups are just Ops
[Mat Kelcey : The map interpretation of attention - YouTube](https://youtu.be/7wMQgveLiQ4?t=863)

There are no parameters in our soft lookup. This is just an op - it is a way of **describing compute**. It says: "when you get a query, key and value, this is what you should output." This is just like an operator.
One interesting thing about this operator is that (when you think about how a map works) it is **invariant** to the order of the rows of the key and the value.
Another interesting note is that this is not dependent on the length of the keys or values.
### Encoder & Decoder
In the traditional context of NLP, where we have encoders and decoders, the bottleneck existed where we pass from the encoder to the decoder. All information had to flow throw a single layer ([Mat Kelcey : The map interpretation of attention - YouTube](https://youtu.be/7wMQgveLiQ4?t=1158)).
Attention effectively allows us to sidestep this! The network can instead learn a function that allows a shortcut from the encoder to the decoder. So the information flow is more direct to the decoder, rather than having to go through the context. This removes the bottleneck. Now the **context** serves the purpose of *conditioning the function $Q$ to look back in the right place*, rather than containing all possible information needed!

---
Date: 20221010
Links to: [Mechanistic Interpretability](Mechanistic%20Interpretability.md)
Tags: #review
References:
* [Attention Is All You Need - YouTube](https://www.youtube.com/watch?v=iDulhoQ2pro)
* [GitHub - jessevig/bertviz: BertViz: Visualize Attention in NLP Models (BERT, GPT2, BART, etc.)](https://github.com/jessevig/bertviz)
* [Transformers from scratch | peterbloem.nl](https://peterbloem.nl/blog/transformers)
* [Mat Kelcey : The map interpretation of attention - YouTube](https://www.youtube.com/watch?v=7wMQgveLiQ4)