# Parsing
**Parsing** can be thought of as the process of recognizing the *shape* of a particular input string. In computer science it has to do with a specific component inside of a [Compiler](Compiler.md). A compiler is essentially a *translator* that translates one language (the language of the programmer) into a systems language.
> Parsing in the most general sense is anything that pulls structure out of a linear representation.
### Compiler Steps
1. So a compiler must have inputs and outputs. You have an input that is a string.
2. The first thing the compiler must do is **parse** the string. In order to handle the string for the best way of the system you have **lexical analysis** of the string. Say you have a given multiplication:`'50 x 10 = 500'`. Here there is a particular reference to how humans understand information. A computational way to understand this would be to analyze each term. Given a sentence such as: "My dad is coming home" you then parse that into verbs, adverbs, nouns, etc. We can think of this as a **tokenization process**.

It needs to be able to do this because the tokens end up creating some sort of *data representation* (binary):

This data representation, in order to be able to be translated, it needs to be able to be put through syntactical analysis.
### Lexer
A **lexer** will go through the input character by character and break up the text into what we call **tokens**. A token is a simple object that has a *type* and (optionally) a *value*. This is necessary for us to identify what everything is.
### Parser
The idea of the **parser** is to build up a syntax tree of the program from the *tokens* created by the *lexer*.
> The **parser** builds the syntax tree.
##### Example 1

We can read the above tree as follows:
* We start at the root node, the plus binary operation. This lets the program know that the node on the left and right need to be added together.
* The program comes to the left node and prepares to add 1 to the right node. However, when it comes to the right node, it sees that it is a multiply operation. This means that more work needs to be done before it can be added to the number 1.
* Then it looks at it's left and right node. It sees the left node having a value of 2 and the right node having a value of 3. It will multiply those numbers and get a result of 6.
* This will "resolve" the multiply node, at which point the plus node can then add 1 and 6, yielding 7.
##### Example 2
Here by adding parenthesis we change the order quite significantly:

##### Back to the parser
So the **parser** needs to:
* Figure out if the tokens match our **language grammar**
* If the tokens can be used to generate a tree accordingly
For instance, two numbers followed by two plus symbols makes no sense in (most) languages: `123 534 ++`. This could not be parsed. However, a number followed by a plus symbol and then another number is perfectly valid: `123 + 456`.
So, before we can ever build a parser we need to define our **grammar rules**. Consider the expression below:

We know that the order of operations should be as follows:

We can give a name to each of the *nodes* :


Now if we pause for a second, realize our lexer is going to create a list of tokens. The idea is that the parser needs to be able to take these tokens and, following the rules of a predefined grammar (such as PEMDAS), create an **abstract syntax tree** from which we can execute.
### Abstract Syntax Tree vs. Parse Tree
Here a parse tree (a **concrete syntax tree**):

We also have seen trees where the operators are nodes. These are **abstract syntax trees**:

We can clearly see below that the abstract syntax tree is just a condensed version of the parse tree:


When we think about how we can encode an [AST](Abstract%20Syntax%20Trees.md), the simplest structure that we can use would be a json structure. Every node in the structure will have a type,
### Parsing Pipeline

In theory, the purpose of the parser is validation-is a string excepted by a grammar or not? However, in practice the parser also produces the next intermediate representation, the AST
##### Grammar
A grammar is a *set of restrictions* on top of the alphabet for a specific language:

What is a **formal grammar**? A grammar is a tuple of 4 elements:

### EBNF
### Context Free Grammar
See [Context-Free-Grammar](Context-Free-Grammar.md).
---
Date: 20210803
Links to: [Computer Science MOC](Computer%20Science%20MOC.md)
Tags:
References:
* [Parsing - Computerphile](https://www.youtube.com/watch?v=r6vNthpQtSI)
* [Make your own programming language - Lexer](https://www.youtube.com/watch?v=Eythq9848Fg)
* [Make your own programming language - Parser](https://www.youtube.com/watch?v=RriZ4q4z9gU)
* [Parsing Algorithms - Abstract Syntax Trees](https://www.youtube.com/watch?v=VKM1eLoN-gI&t=490s)
* [Formal grammars, context-free](https://www.youtube.com/watch?v=VZ5DJopq5JA&list=PLGNbPb3dQJ_6aPNnlBvXGyNMlDtNTqN5I&index=1)
* [Parsing horrible things with python](https://www.youtube.com/watch?v=tCUdeLIj4hE)