# Beam Search
Consider a generative model that is producing a sequence of text - specifically, let's say it is translating a french sentence to an english one. Given the tokens of the french sentence, the model then needs to generate the english sentence one token at a time.
There are several ways that it could do this. One way would be to try and translate it all at once. Another would be to generate the first english token, then the second, then the third, and so on - *without* having subsequent token generations be based on previous. Mathematically we can define this as:
$p(y_1 | x) \rightarrow p(y_2 | x) \dots \rightarrow p(y_n | x)$
Where $x$ is our french sentence, $y_1, \dots, y_n$ is our output sentence, and we try to maximize the probability of the output sequence *one at a time*. So, we pick the token that maximizes $p(y_1 | x)$, then *without conditioning on what we picked for $y_1$*, we pick $y_2$ to maximize $p(y_2 | x)$.
**Beam Search** tries to consider far more information when selecting $y