Coming from a software engineering background, I find it can be helpful to think of * **DNA**: Similar to *disk* or *storage*. This is your genetic code. * **RNA**: Similar to *RAM*. It takes DNA and reads from it to then create *amino acids*. * **Amino Acids**: Similar to *syntax*. They can be thought of as building blocks. * **Protein**: Similar to a biological *program*. Composed of many amino acids. So a protein itself is a sequence of amino acids. But this sequence has a *shape* in 3 dimensions. We can refer to this as its *structure*. This structure determines the proteins biological function. ![400](Pasted%20image%2020221205065238.png) The reason that **Alphafold2** is such a big deal is because it allows us to predict the *structure* (i.e. the *function*) of a protein, given its *sequence*. ![](alphafold2_proteins.gif) ### Amino Acids There are 20 types of amino acids. They all possess the same structure: * An $\alpha$ carbon * Side Chain * Carboxyl group ![](Pasted%20image%2020221205065719.png) However, how this works doesn’t really matter! Just as we do not need to deeply know how circuits work in order to program computers, we can abstract this away. > The **layers of abstraction** enable us to not focus on amino acids. So instead we can effectively think of each amino acid as a *letter*: ![](Pasted%20image%2020221205070054.png) So a protein (a composition of amino acids) could look like: `RHKDE`. ### Proteins When we talk about the structure of proteins there are four kinds that we really think about: 1. The sequence of amino acids comprising the proteins. We can think of this as a **vector**. 2. Depending on how they are bound together, the hydrogen bindings make the amino acids fold into repeating patterns. These can either be an alpha helix or pleated sheets. 3. The 3-dimensional shape of a protein, known as it’s **tertiary structure**. If ever hear the term *contact* prediction it simply means at which points does the protein fold onto itself. This is where most of the focus is in papers today. 4. **Quaternary structure** is similar to tertiary structure, but describes when we are dealing with more than one amino acid chain. ![](Pasted%20image%2020221205070251.png) For the purposes of Alphafold2 we will mainly be discussing primary and tertiary structure. ### Vocabulary Size We can note that the difference between a DNA and a protein from a syntax perspective is: * Amino Acids have a vocabulary size of 20 * DNA has a vocabulary size of 4 (`ACTG`, where each element is known as a *nucleotides*) Now of course these *characters* that we are referencing are just an *abstraction* for molecules. Nature doesn’t really have `ACTG`'s. But this abstraction is helpful because we can then treat DNA as code instead of treating it just as a bunch of molecules. Note that if we take three nucleotides they will make up a *codon*, e.g. `AUG`. Every codon encodes for a specific amino acid. If we have 3 slot and 4 items to fill them will, that yields 64 combinations. Yet there are only 20 different amino acids. This just means that multiple codons encode for the same amino acid (a form of biological redundancy, similar to humans having two kidneys or lungs). ![](Pasted%20image%2020221206073823.png) ### Bert on Genes Now given a sequence of amino acids that comprise a protein, such as `RHKDE`, the question we wish to ask is can we take this *sequence* and turn it into a *structure* by using a neural network that works really well over structures, such as a [transformer](Transformers.md)? We can state our problem as: > Given a *sequence* can we predict the *contact points* using a *transformer*? Which amino acids are paying attention to which other amino acids in other layers? ### Thinking about Proteins You can think of proteins as a sequence, but in terms of tertiary structure we can think of it as a 3-d shape. And a 3-d shape can be thought of as a *point cloud*. --- Date: 20221205 Links to: Tags: #review References: * [Alphafold2 for biology n00bs - YouTube](https://www.youtube.com/watch?v=ZiuNC7Q1pZw&list=PLwcClAaLqrJy4wrVTB43HDlnzW-d20fUt&index=9&t=718s) * [Introduction to proteins and amino acids (article) | Khan Academy](https://www.khanacademy.org/science/biology/macromolecules/proteins-and-amino-acids/a/introduction-to-proteins-and-amino-acids) * [Reverse Engineering the source code of the BioNTech/Pfizer SARS-CoV-2 Vaccine - Bert Hubert's writings](https://berthub.eu/articles/posts/reverse-engineering-source-code-of-the-biontech-pfizer-vaccine/) * [https://arxiv.org/pdf/2006.15222.pdf](https://arxiv.org/pdf/2006.15222.pdf) * [https://www.biorxiv.org/content/10.1101/2020.12.28.424589v1.full.pdf](https://www.biorxiv.org/content/10.1101/2020.12.28.424589v1.full.pdf) * [AlphaFold2 @ CASP14: “It feels like one’s child has left home.” « Some Thoughts on a Mysterious Universe](https://moalquraishi.wordpress.com/2020/12/08/alphafold2-casp14-it-feels-like-ones-child-has-left-home/) * [Fabian Fuchs](https://fabianfuchsml.github.io/alphafold2/)