# Machine Learning ### Key Idea > **Key Idea**: If the rules that describe how **inputs** map to **outputs** are complex and full of special cases & exceptions, it is easier to provide **data** or **examples** than to implement those rules. ### Main Components We can think of ML as generally having four main components: 1. **Learning Algorithm** (e.g. [Backpropagation](Backpropagation.md)) 2. **Parameters** (Weights, factors, etc) 3. **Model** (RBMs, Deep Nets, Linear Regression, etc) 4. **Objective Function** (Squared Error, Max Likelihood, Cross Entropy, i.e. a [Loss Function](Loss%20Function.md)) See the [Machine-Learning-Method](Machine-Learning-Method.md). ### History ![](Screen-Shot-2021-11-19-at-64555-AM.png) ### Supervised Learning * It often makes more sense to predict *probabilities/probability distributions* instead of *discrete labels*. This is a key distinction. Also, this is *easier to learn* due to *smoothness*. Intuitively, we can't change a discrete label "a tiny bit"; it's all or nothing. But we *can* change probability "a tiny bit". So, given a training set, instead of learning a function that maps $X$ to $Y$, we are going to learn a function that maps $X$ to a **distribution over $Y$**. Formally, instead of learning: $f_{\theta}(x) \approx y$ We are going to learn: $p_{\theta}(y \mid x)$ * [Discriminative-vs-Generative](Discriminative-vs-Generative.md) * Why could realistically "any" function work to ensure that have a valid probability distribution? After all, don't we want to learn the "right" probabilities? Don't we need a "special" function to learn the right probabilities? Why is it that we can use any function that makes them positive and sum to one? The answer is that your choice of function in machine learning isn't meant necessarily to produce the right answer. You produce the right answer by choosing the right parameters. The function just needs to be *general* enough so that their exists a choice of $\theta$ that represents the right answer. So as long as this function doesn't lose too much information, you could probably chose $\theta$ so that it outputs the right answer. * Why do we chose the exponential function (softmax)? See [here](https://youtu.be/oLc822BT-K4?list=PL_iWQOsE6TfVmKkQHucjPAoRtIJYt8a5A&t=1073). Essentially, the exponential function is especially convenient because it is [one to one and onto](Injective-Surjective-Bijective.md). It maps the real number line to the entire positive part of the real number line. In a sense it is the least restrictive way to turn any number into a positive number, because any real number will turn into a positive number, and their exists a real valued number that will produce any desired positive number if passed through the exponential function. ![](Screen-Shot-2021-11-24-at-121155-PM.png) The big idea/importance here is that their isn't anything special about the softmax; we simply need number to be positive and sum to 1. Softmax is a convenient way of doing this. * Consider the idea of [Machine Learning and Expressiveness](Machine-Learning-and-Expressiveness.md) * Remember to consider [Empirical Risk and True Risk](Empirical-Risk-and-True-Risk.md) ### Summary ![](Screen%20Shot%202021-12-03%20at%207.25.00%20AM.png) --- Date: 20211118 Links to: [AI MOC](AI%20MOC.md) [003-Data-Science-MOC](003-Data-Science-MOC.md) Tags: References: * [Solid overview](https://www.youtube.com/watch?v=FHsGHxQYxvc&list=PL_iWQOsE6TfVmKkQHucjPAoRtIJYt8a5A&index=2)