Parsing - Nate's Notes

# Parsing **Parsing** can be thought of as the process of recognizing the *shape* of a particular input string. In computer science it has to do with a specific component inside of a [Compiler](Compiler.md). A compiler is essentially a *translator* that translates one language (the language of the programmer) into a systems language. > Parsing in the most general sense is anything that pulls structure out of a linear representation. ### Compiler Steps 1. So a compiler must have inputs and outputs. You have an input that is a string. 2. The first thing the compiler must do is **parse** the string. In order to handle the string for the best way of the system you have **lexical analysis** of the string. Say you have a given multiplication:`'50 x 10 = 500'`. Here there is a particular reference to how humans understand information. A computational way to understand this would be to analyze each term. Given a sentence such as: "My dad is coming home" you then parse that into verbs, adverbs, nouns, etc. We can think of this as a **tokenization process**. ![](Screen%20Shot%202021-08-03%20at%206.58.06%20AM.png) It needs to be able to do this because the tokens end up creating some sort of *data representation* (binary): ![](Screen%20Shot%202021-08-03%20at%206.59.32%20AM.png) This data representation, in order to be able to be translated, it needs to be able to be put through syntactical analysis. ### Lexer A **lexer** will go through the input character by character and break up the text into what we call **tokens**. A token is a simple object that has a *type* and (optionally) a *value*. This is necessary for us to identify what everything is. ### Parser The idea of the **parser** is to build up a syntax tree of the program from the *tokens* created by the *lexer*. > The **parser** builds the syntax tree. ##### Example 1 ![](Screen%20Shot%202021-08-03%20at%207.46.37%20AM.png) We can read the above tree as follows: * We start at the root node, the plus binary operation. This lets the program know that the node on the left and right need to be added together. * The program comes to the left node and prepares to add 1 to the right node. However, when it comes to the right node, it sees that it is a multiply operation. This means that more work needs to be done before it can be added to the number 1. * Then it looks at it's left and right node. It sees the left node having a value of 2 and the right node having a value of 3. It will multiply those numbers and get a result of 6. * This will "resolve" the multiply node, at which point the plus node can then add 1 and 6, yielding 7. ##### Example 2 Here by adding parenthesis we change the order quite significantly: ![](Screen%20Shot%202021-08-03%20at%207.50.59%20AM.png) ##### Back to the parser So the **parser** needs to: * Figure out if the tokens match our **language grammar** * If the tokens can be used to generate a tree accordingly For instance, two numbers followed by two plus symbols makes no sense in (most) languages: `123 534 ++`. This could not be parsed. However, a number followed by a plus symbol and then another number is perfectly valid: `123 + 456`. So, before we can ever build a parser we need to define our **grammar rules**. Consider the expression below: ![](Screen%20Shot%202021-08-03%20at%207.55.48%20AM.png) We know that the order of operations should be as follows: ![](Screen%20Shot%202021-08-03%20at%207.56.07%20AM.png) We can give a name to each of the *nodes* : ![](Screen%20Shot%202021-08-03%20at%207.57.13%20AM.png) ![](Screen%20Shot%202021-08-03%20at%207.59.17%20AM.png) Now if we pause for a second, realize our lexer is going to create a list of tokens. The idea is that the parser needs to be able to take these tokens and, following the rules of a predefined grammar (such as PEMDAS), create an **abstract syntax tree** from which we can execute. ### Abstract Syntax Tree vs. Parse Tree Here a parse tree (a **concrete syntax tree**): ![](Screen%20Shot%202021-08-03%20at%208.23.37%20AM.png) We also have seen trees where the operators are nodes. These are **abstract syntax trees**: ![](Screen%20Shot%202021-08-03%20at%208.23.51%20AM.png) We can clearly see below that the abstract syntax tree is just a condensed version of the parse tree: ![](Screen%20Shot%202021-08-03%20at%208.25.02%20AM.png) ![](Screen%20Shot%202021-08-03%20at%208.26.51%20AM.png) When we think about how we can encode an [AST](Abstract%20Syntax%20Trees.md), the simplest structure that we can use would be a json structure. Every node in the structure will have a type, ### Parsing Pipeline ![](Screen%20Shot%202021-08-03%20at%208.36.54%20AM.png) In theory, the purpose of the parser is validation-is a string excepted by a grammar or not? However, in practice the parser also produces the next intermediate representation, the AST ##### Grammar A grammar is a *set of restrictions* on top of the alphabet for a specific language: ![](Screen%20Shot%202021-08-03%20at%208.40.53%20AM.png) What is a **formal grammar**? A grammar is a tuple of 4 elements: ![](Screen%20Shot%202021-08-03%20at%208.43.59%20AM.png) ### EBNF ### Context Free Grammar See [Context-Free-Grammar](Context-Free-Grammar.md). --- Date: 20210803 Links to: [Computer Science MOC](Computer%20Science%20MOC.md) Tags: References: * [Parsing - Computerphile](https://www.youtube.com/watch?v=r6vNthpQtSI) * [Make your own programming language - Lexer](https://www.youtube.com/watch?v=Eythq9848Fg) * [Make your own programming language - Parser](https://www.youtube.com/watch?v=RriZ4q4z9gU) * [Parsing Algorithms - Abstract Syntax Trees](https://www.youtube.com/watch?v=VKM1eLoN-gI&t=490s) * [Formal grammars, context-free](https://www.youtube.com/watch?v=VZ5DJopq5JA&list=PLGNbPb3dQJ_6aPNnlBvXGyNMlDtNTqN5I&index=1) * [Parsing horrible things with python](https://www.youtube.com/watch?v=tCUdeLIj4hE)