Background - Probability Theory Review
One of the most common tasks in machine learning is to produce a system that can make predictions based on some observed data. This is done by building a model to describes the relationship between the observed data and the quantity which we would like to predict. One of the common ways to describe this relationship is through a probabilistic model.
A large portion of this course will be devoted to the problem of building the model. In this tutorial, however, we will only focus on the part of making predictions based on a given probabilistic model.
🎰 Basic Concepts
We shall start with a quick recap of the basics of probability theory. As an example, we will take a look at the following random phenomenon:
We take a glass full of juice and pour it on the floor (don’t try this at home) and look at the shape of the resulting spill.
Below is a table and a schematic diagram of some probabilistic concepts which we can define related to this phenomenon:
name | Usually donated by | Definition | Example |
---|---|---|---|
A random phenomenon | - | Some process which generates random outcomes | Spilling a glass of juice on the floor and examining at the shape of the spill |
A sample | A single outcome of the process | Some specific spill shape | |
Sample space | The space of all possible outcomes of the given process. | The space of all possible spill shapes | |
Random Variables (RV) | ,,… | A function which assigns a real number to a given sample | A function which returns the perimeter of a spill: A function which returns the area of a spill: |
An event | ,,… | A collection of events, i.e., a subset of the sample space . We would often define an event through a condition on random variables. |
The collection of all spills with a perimeter smaller than 2: The collection of all spills with an area larger than 1: |
Event space | A space of events. | The space of all possible collections of spills shape | |
Probability measure | A function which returns the probability of a random sample to be an element in some event | ||
Conditional probability measure | A function which returns the probability of a random sample to be an element in event given that it is an element in event | The probability of a spill to have a diameter smaller than 2, given that it has an area larger than 1: |
- In the last two rows, we have used as a shorthand for . This is, in fact, a common shorthand writing which we will be using from here on.
Note that by definition, any function of a random variable (RV) is also an RV. For example or in general.
Realizations
The outcome of an RV (random variable) for a given sample is called a realization. We will use upper case letters do denote RVs and the equivalent lower case latter to denote their realization. I.e., for the RV we will have:
Slight abuse of notation: In many cases, we will use the name samples to refer to the realizations (and not to the samples from the sample space).
Random Vectors
We will usually be interested in working with a series of RVs. In this case, we will usually pack them together in a vector, called a random vector.
We will use bold letters to denote vectors, and will usually define them as column vectors:
and it’s realization:
Toy Model: The 🍕 Pizza & Pancakes 🥞 Drive-Thru
We would like to help a new local drive-thru business that sells pizzas and pancakes. Which would like to be able to make some predictions regarding his business.
✍️ Exercise 1.1 - Warm-Up
The random phenomenon we will be interested in is that of a customer making an order. Our sample will a single customer making an order. Give examples for:
- 2 discrete random variables
- 2 continuous random variables
- 2 events
- 2 possible probability measures.
- A possible probability measure using the union of the 2 events.
- A possible probability measure using the intersection of the 2 events.
- A possible probability measure using the exclusion of the 2 events.
- A possible conditional probability measure.
Make sure that all the probability measures are consistent with one another.
💡 Solution
- 2 discrete random variables
- : The number of pizza slices the customer ordered.🍕
- : The number of pancakes slices the customer ordered.🥞
- 2 continuous random variables
- : The length of the customer’s car in meters. 🚗
- : The amount of tip the customer left in dollars. 💵
- 2 events
- The customer ordered exactly 1 slice of pizza: .
- The customer left more then 2$ tip .
- 2 possible probability measures.
- The probability that the customer ordered 1 slice: .
- The probability that the customer has left more than 2$ tip: .
- A possible probability measure using the union of the 2 events.
- The probability that the customer either ordered 1 slice or left more than 2$ tip .
- An example of a possible probability measure using the intersection of the 2 events.
- The probability that the customer ordered 1 slice and left more than 2$ tip .
- An example of a possible probability measure using the exclusion of the 2 events.
- The probability that the customer ordered 1 slice but did not leave more than 2$ tip .
- An example of a possible conditional probability measure.
- The probability that the customer ordered 1 slice given that he left more than 2$ tip .
📊 Distributions
The following definitions/rules describe the case of random vector, but they hold just as well for a scalar RV in the degenerate case of a random vector of length 1.
Distribution Functions
The distribution of a random vector describes the probability that it’s realizations will be equal to some given values/range of values. The distribution can be described in one of the following functions:
A Cumulative Distribution Function (CDF)
The CDF of a random vector is usually denoted by and is defined as:
For a Discrete RV/Random Vector: A Probability Mass Function (the PMF)
The PMF of a discrete random vector is usually denoted by or and is defined as:
For a Continuous or Mixed RV: A Probability Density Function (PDF)
The PDF of a continuous random vector is also usually denoted by or .
In the cases where the CDF of a random vector is differentiable, then the pdf is defined as it’s derivative:
Otherwise, it is defined through the following integral:
Conditional Distribution
We will also be interested in talking about the distribution of random vector when conditioned on some other random vectors. Similarly to the regular distributions, the conditional distribution is defined using the condition probability:
CDF
PMF
Some Basic Rules
The law of total probability
In these cases, where represents the distribution of only a subsection of the RVs, we referred to it as the marginal distribution.
The sum over represents the sum over all the possible values can take (without repetitions).
The Conditional Distributions Definition:
Bayes’ Theorem:
By using the above relation for both and we can derive the following useful relationship:
Back to The 🍕 Pizza & Pancakes 🥞 Drive-Thru
✍️ Exercise 1.2 - Discrete Distributions
We are given the following joint distribution of the number of slices and the number of pancakes customers buy:
🍕 \ 🥞 | 0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 0 | 0.15 | 0.2 | 0.05 |
1 | 0.08 | 0.03 | ??? | 0.04 |
2 | 0.02 | 0.03 | 0.04 | 0.01 |
3 | 0.1 | 0.05 | 0.05 | 0.1 |
(We will assume that 3 is the maximal number of pizza slices or pancakes a custer will buy).
A) What is the missing number in the table?
B) What is the probability that a customer will buy 1 slice of pizza, given that he is not going to buy any pancakes?
C) what is the probability that 2 independent customers will buy 3 pizza slices all together?
💡 Solution
A) The sum of all possible event should always be 1, therefore the missing value must be equal to:
B) By definition:
C) Since the two events which are independent, the probability that both of them will happen is given by the product of the two probabilities. In order to calculate the probability buy a total of 3 pizza slices we must sum over all the combinations which sum-up to 3 pizza slices: (customer 1:0 & customer 2: 3, customer 1:1 & customer 2: 2, etc.. ).
We shall start by calculating the marginal distribution of the amount of pizza slices bought according to:
Therefore the probability of two customers buying 3 slices is:
✍️ Exercise 1.3 - Mixed Distributions
As it turns out, there is a correlation between the length of a customer’s car and the number of pizza slices he orders.
A deep survey of this relation has divided the customers into 4 groups according to the number of slices the usually order, an came out with a probabilistic model for the length of the car of each group. According to the model the distribution of the length of a customers car, , conditioned on the number of slices he orders, , has a normal distribution with the following parameters:
We can spot a customer with a car length of 4.4m heading towards the drive-thru, and we would like to start preparing his pizzas. Find the conditional distribution of the pizza slices he would order given that his car length is 4.4m. I.e., find
💡 Solution
According to Bayes’ theorem:
We shell start by calculating the nominator, :
Now we can easily calculate the denominator, which is in fact the normalization factor of the nominator:
Therefore:
🗜️ Expectations
The expectation value\Mean
The expectation value or mean of a random vector is defined as:
Variance
The variance of a (scalar) RV, , is define as:
Where is called the standard deviation of and is define as the square root of the variance.
Covariance
For a pair of RV, the covariance is defined by:
Covariance Matrix
For a random vector, we will define a covariance matrix, where the element of the matrix is the covariance of and :
This can also be written as:
Gaussian Vectors (Multivariate Normal Distribution)
One important distribution of a random vector is the distribution of a Gaussian vector, which is also called a multivariate normal distribution. A random vector is a Gaussian random vector if any linear combination of its components produces an RV with a normal distribution.
The multivariate normal distribution of a Gaussian vector , is defined by it’s a mean , the covariance matrix :
Where in the length of the Gaussian vector.
Conditional Distributions
In many cases we would like to calculate the distribution of a subset of the Gaussian vector conditioned by the rest of the vector. We shall split the vector into to part and denote the first part by and the second by , such that:
Similarly we will divide the mean vector and the covariance matrix in a similar way:
The distribution of will also be a multivariate normal distribution with the following parameters:
✍️ Exercise 1.4 - Conditional Multivariate Normal Distribution
A) We are given the following parameters of a Gaussian vector:
Calculate
B) We are given the following parameters of a Gaussian vector:
Calculate
C) We are given the following parameters of a Gaussian vector:
Calculate
💡 Solution
A) We will define and
And accordingly:
Following the formula for the section above the mean and covariance of the conditional distribution will be:
B) Similarly
And
C) Similarly
And
Prediction Theory
🔮 The Problem
Let us look at the problem of trying to give an “optimal” prediction for the number of slices a customer will buy given that his car length is 4.4m. In order to do so we must first define what do we mean by “optimal” prediction. Let us look at a few optional definitions:
- The prediction with the lowest probability to be wrong.
- The prediction with the lowest error on average.
- The prediction with the lowest squared error on average.
As it turns out that each of the above definitions results in a different prediction. There is an infinite amount of ways to define the optimal prediction, where each definition is preferable in different scenarios. In this course, we will mostly use the first and last definitions.
📉 The Loss and The Risk Functions
It turns out that in most cases, we would like to define the optimal prediction as the prediction which minimizes, on average, some penalty function. Let us write it in a more formal way.
Notations:
- It is common to use the hat symbol to denote the prediction. E.g., is a prediction for the RV .
- The function calculating the penalty is called the loss function. It is a function which assigns a score for how bad is a prediction, given the prediction and the actual realization . We will denote this function by
- The expectation value of the loss function is called the risk function. Which we will denote by
Under this definition, the “optimal” prediction can be written as the solution to the following optimization problem:
A common notation is to use the star sign () to denote the optimal prediction.
Returning to the 3 examples for above, we can now describe them through their loss function:
Definition | Loss | Loss Name | Risk Name |
---|---|---|---|
lowest probability to be wrong | Zero-one loss | Misclassification rate | |
lowest error on average | MAE (mean absolute error) | ||
lowest squared error on average | MSE (mean squared error) |
🔧 A Note About Optimization Problems
Optimization problems are problems of the following form:
Where:
- is either a scalar or a vector.
- is some arbitrary function, which is usually referred to as the objective function (but people also use many other names).
- The s and s are optional additional equality and inequality constraints on .
Our goal is to fine the value of which produces the minimal value of among all which obey the constraints.
The mathematical field of optimization is very broad and deals with finding various methods for efficiently solving such problems. In this course, we will not dive into this field except for a few simple cases:
- The case where there are no constraints and the solution can be found analytically by differentiating and comparing to zero.
- The case where there are only equality constraints (s) and the solution can be found analytically by a technique called Lagrange multipliers, which we will present shortly.
- The case where is discrete and can only take a small finite set of values. Here we can calculate for all possible and pick the minimal one.
- The case where there are no constraints and we cannot solve the problem analytically, but we can calculate the gradient of . For this case we will use a method called gradient descent which we will introduce later on in the course.
📚 Solutions For Popular Loss Functions
✍️ Exercise 1.5
A) Show that in the discrete case, the optimal predictor for the zero-one loss is the point with the maximal PMF:
B) Show that in the continuous case, the optimal predictor for the loss is the median:
(in the discrete case the optimal predictor is the point closest to being the median, i.e., the point in which both the probability of being larger and the probability of being smaller then are less or equal to )
C) Show that the optimal predictor for the loss is the mean:
💡 Solution
A) As defined above, the zero one-loss is given by , therefore the optimal predictor is:
The sum in the last line is simply a sum over all the possible values of except for . Therefore we can write it in the following way:
B) The loss is given by . Therefore the optimal predictor is:
We can solve this optimization problem by differentiating according to and comparing to zero
C) The loss is given by . Therefore the optimal predictor is:
We can solve this optimization problem by differentiating according to and comparing to zero
✍️ Exercise 1.6 - Predicting The Number of Pizza Slices
Continuing the task from earlier, we would now like to make a prediction for the amount of pizza slices a customer will order given that the length of his car is 4.4m. Calculate the optimal prediction for each of the 3 loss functions.
💡 Solution
We have found earlier that:
According to the results from the previous question, the optimal predictions are:
- Zero-one loss:
- loss:
The value which is closest to being the median is
(with a probability of to be above and to be below)
- loss:
✍️ Exercise 1.7 Heart Rate and Blinking
A new research has found a connection between the blinking rate of a person and his heart rate. We will use to denote the RV of a person’s heart rate and to denote a person’s blinking rate. According to the research, the distribution of the random vector can be roughly modeled using a multivariate normal distribution with the following parameters:
Find the function which, gives the optimal prediction for a person’s heart rate giving his blinking rate. Find the solution for each one of the three losses. Is there a difference between the 3 predictions? Why is that?
💡 Solution
We will use the formula for calculating conditional multivariate normal distribution. In this case
Therefore:
Since the mean, the median and the point of the maximal distribution are all the same in a gaussian all the three predictors will be the same and equal to:
This will always be true for symmetrical distributions such as the normal distribution.
Lagrange Multipliers
The method of Lagrange multipliers comes to solve an optimization problem with only equality constraints:
and the s must all be differentiable.
We will introduce a new set of variables , one for each constraint, and define the following function:
is called the lagrangian and the s are called the Lagrange multipliers.
It can be shown that the if is a solution of the original optimization problem, then there exist some set of values s for which is a stationary point of the Lagrangian. I.e.:
Therefore we can easily find candidates for but differentiating and solving this last equation.
✍️ Exercise 1.8 Lagrange Multipliers
Solve the following optimization problem:
💡 Solution
We will introduce one Lagrange multiplier and define the following Lagrangian:
By differentiating and comparing to zero we get the following result: