Beam Search - Nate's Notes

# Beam Search Consider a generative model that is producing a sequence of text - specifically, let's say it is translating a french sentence to an english one. Given the tokens of the french sentence, the model then needs to generate the english sentence one token at a time. There are several ways that it could do this. One way would be to try and translate it all at once. Another would be to generate the first english token, then the second, then the third, and so on - *without* having subsequent token generations be based on previous. Mathematically we can define this as: $p(y_1 | x) \rightarrow p(y_2 | x) \dots \rightarrow p(y_n | x)$ Where $x$ is our french sentence, $y_1, \dots, y_n$ is our output sentence, and we try to maximize the probability of the output sequence *one at a time*. So, we pick the token that maximizes $p(y_1 | x)$, then *without conditioning on what we picked for $y_1$*, we pick $y_2$ to maximize $p(y_2 | x)$. **Beam Search** tries to consider far more information when selecting $ys so as to maximize the probability of the output sentence. First it considers *multiple* values for $y_1$, not simply that which has the maximal probability. Say it considers 3 values (i.e., for the given french sentence $x$, maybe the 5 most probable words to start the english translation are $september$, $she$, $jane$ - then beam search will effectively be a graph search that *conditions* subsequent probabilities based on starting with these three words). The other thing it does (mentioned above) is it *conditions* on the starting word! $p(y_1 | x) \rightarrow p(y_2 | y_1, x) \dots \rightarrow p(y_n |y_1,\dots,y_{n-1}, x)$ --- Date: 20230903 Links to: Tags: References: * [C5W3L03 Beam Search - YouTube](https://www.youtube.com/watch?v=RLWuzLLSIgw)