# Residual Connections The reason for having the residual connection in Transformer is more technical than motivated by the architecture design. Residual connections mainly help mitigate the vanishing gradient problem. During the back-propagation, the signal gets multiplied by the derivative of the activation function. In the case of ReLU, it means that in approximately half of the cases, the gradient is zero. Without the residual connections, a large part of the training signal would get lost during back-propagation. Residual connections reduce effect because summation is linear with respect to derivative, so each residual block also gets a signal that is not affected by the vanishing gradient. The summation operations of residual connections form a path in the computation graphs where the gradient does not get lost. Another effect of residual connections is that the information stays local in the Transformer layer stack. The self-attention mechanism allows an arbitrary information flow in the network and thus arbitrary permuting the input tokens. The residual connections, however, always "remind" the representation of what the original state was. To some extent, the residual connections give a guarantee that contextual representations of the input tokens really represent the tokens. ![](Pasted%20image%2020230130064825.png) In [Transformers](Transformers.md) this looks like: ![](Pasted%20image%2020230130064855.png) ### Inductive Bias There is a clear **inductive bias** at play with ResNets. Specifically, for deep networks each layer only does a little bit of transformation. So we should bias towards keeping it the same and just changing it a little bit. So the inherent bias that is present is the **identity transform** (see more [here](https://youtu.be/jltgNGt8Lpg?t=167)). --- Date: 20230130 Links to: [Neural Networks MOC](Neural%20Networks%20MOC.md) [Transformers](Transformers.md) Tags: References: * [neural networks - Why are residual connections needed in transformer architectures? - Cross Validated](https://stats.stackexchange.com/questions/565196/why-are-residual-connections-needed-in-transformer-architectures) * [Let's build GPT: from scratch, in code, spelled out. - YouTube](https://youtu.be/kCc8FmEb1nY?t=5357)