Traditional vs Probabilistic Machine Learning

# Traditional vs Probabilistic Machine Learning In traditional machine learning we have a set of features, $X$, and a target, $Y$. Let's say we have $n$ specific examples. We say that their is a function, $f$, relating $X$ and $Y$: $f: X \rightarrow y$ But we do not know $f$, so we wish to *learn* an approximation, $\hat{f}$. Now there are of course many ways to do this. We can prescribe a preset *functional form*, such as a *linear model*, that has a set of *parameters* that we must learn. How do we know which parameters to learn? We learn the parameters that allow us to *minimize a loss function*, $L$ (where the loss is simply a function of the predictions made via $\hat{f}$, and the true targets, $Y$). This process of tuning function parameters in order to minimize a loss is a form of *optimization*. Now at no point in the above formulation did probability (explicitly) enter the picture. However, we can formulate this from a probabilistic perspective! We can do this via [Maximum Likelihood](Maximum%20Likelihood.md), which effectively can be thought of as: > Choose the model parameter(s) so that the observed data has the highest likelihood. But wait, what parameters? Well if we switch to the probabilistic context we now are going to say that our target follows some sort of probability distribution, such as: $y = \mathcal{N}(\mu, \sigma)$ So, our model $\hat{f}$ will no longer predict a *point estimate*, but rather a *distribution*! How exactly does that work? Its actually quite straightforward! It can be thought of as follows: 1. Our model $\hat{f}$ will take as input an $x \in X$ and will now output *two* real numbers, $\mu$ and $\sigma$! These two numbers are the parameters of the normal distribution that we said $y$ takes on. 2. We plug these two predicted numbers in and then get our normal distribution, which is our $\hat{y}$. At the same time, we know the true $y$. So we can then use *maximum likelihood* in order to determine *how likely it was to observe the given point $y$ if the true distribution of $y$ was $\mathcal{N}(\mu, \sigma)$*. 3. We can then iteratively tune our model parameters so that we predict distributions that maximum the likelihood of observing the data we did! --- Date: 20230519 Links to: [Machine-Learning](Machine-Learning.md) [Probabilistic Deep Learning](Probabilistic%20Deep%20Learning.md) Tags: References: * [Introduction to Bayesian Linear Regression | by Will Koehrsen | Towards Data Science](https://towardsdatascience.com/introduction-to-bayesian-linear-regression-e66e60791ea7) * Probabilistic Deep Learning with Tensorflow Probability, page 94, 126