Background - Probability Theory Review
-
Goal: produce a system which makes predictions based on observed data.
-
Method: build a model of the relationship the observed and the target data.
-
One common ways is to use a probabilistic model.
-
Most of the course: we will focus on building the model.
-
In this tutorial: we will focus on making predictions given a probabilistic model.
๐ฐ Basic Concepts
An example for a random phenomenon:
We take a glass full of juice and pour it on the floor (donโt try this at home) and look at the shape of the resulting spill.
๐ฐ Basic Concepts - Cont.
๐ฐ Basic Concepts - Cont. 2
name | Usually donated by | Definition | Example |
---|---|---|---|
A random phenomenon | - | Some process which generates random outcomes | Spilling a glass of juice on the floor and examining at the shape of the spill |
A sample | A single outcome of the process | Some specific spill shape | |
Sample space | The space of all possible outcomes of the given process. | The space of all possible spill shapes | |
Random Variables (RV) | ,,โฆ | A function which assigns a real number to a given sample | A function which returns the perimeter of a spill: A function which returns the area of a spill: |
> An event | ,,โฆ | A collection of events, i.e., a subset of the sample space . We would often define an event through a condition on random variables. |
The collection of all spills with a perimeter smaller than 2: The collection of all spills with an area larger than 1: |
Event space | A space of events. | The space of all possible collections of spills shape | |
Probability measure | A function which returns the probability of a random sample to be an element in some event | ||
Conditional probability measure | A function which returns the probability of a random sample to be an element in event given that it is an element in event | The probability of a spill to have a diameter smaller than 2, given that it has an area larger than 1: |
๐ฐ Basic Concepts - Notations
- Realizations: The outcome of an RV (random variable) for a given sample:
- We will use as a shorthand for
- Any function of an random variable (RV) is also a RV.
For example or .
Slight abuse of notation: We will use the name samples to refer to the realizations.
๐ฐ Basic Concepts - Cont. 4
Random Vectors
A series of RVs packed together in a vector:
and itโs realization:
Toy Model:
The ๐ Pizza & Pancakes ๐ฅ Drive-Thru
We would like to help a new local drive-thru business that sells pizzas and pancakes. Which would like to be able to make some predictions regarding his business.
โ๏ธ Exercise 1.1 - Warm-Up
The random phenomenon we will be interested in is that of a customer making an order. Our sample will a single customer making an order. Give examples for:
- 2 discrete random variables
- 2 continuous random variables
- 2 events
- 2 possible probability measures.
- A possible probability measure using the union of the 2 events.
- A possible probability measure using the intersection of the 2 events.
- A possible probability measure using the exclusion of the 2 events.
- A possible conditional probability measure.
Make sure that all the probability measures are consistent with one another.
๐ก Solution
-
2 discrete random variables
- $X\left(\omega\right)$: The number of pizza slices the customer ordered.๐
- $Y\left(\omega\right)$: The number of pancakes slices the customer ordered.๐ฅ
-
2 continuous random variables
- $L\left(\omega\right)$: The length of the customer's car in meters. ๐
- $T\left(\omega\right)$: The amount of tip the customer left in dollars. ๐ต
-
2 events
- The customer ordered exactly 1 slice of pizza: $X=1$.
- The customer left more then 2 USD tip $T>2$.
๐ก Solution - Cont
-
2 possible probability measures.
- The prob. a customer will order 1 slice: $$Pr\left(X=1\right)=0.2$$.
- The prob. a customer will leave more than 2 USD tip: $$Pr\left(T>2\right)=0.5$$
-
A possible prob. measure using the union of the 2 events.
- The prob. of either "will order 1 slice" or "tip more than 2 USD" $$Pr\left(X=1 \cup T>2\right)=0.6$$
๐ก Solution - Cont 2
-
A possible probability measure using the intersection of the 2 events.
- The prob. of "order 1 slice" and "tip more than 2 USD" $$Pr\left(X=1 \cap T>2\right)=0.1$$
-
A possible probability measure using the exclusion of the 2 events.
- The prob. of "order 1 slice" but not "tip more than 2 USD" $$Pr\left(X=1 - T>2\right)=0.4$$
-
A possible conditional probability measure.
- The prob. of "order 1 slice" given that "tip more than 2 USD" $$Pr\left(X=1 \lvert T>2\right)=0.5$$
๐ Distributions
Distribution Functions
CDF (Cumulative Distribution Function)
Distribution Functions - Cont.
PMF (Probability Mass Function) - (Discrete)
PDF (Probability Density Function) - (Continuous)
Or:
Conditional Distribution
CDF
PMF
Note: A scalar RV is simply a random vector of length 1.
Some Basic Rules
The law of total probability (Marginal Probability)
The Conditional Distributions (Definition):
Some Basic Rules - Cont.
Bayesโ Theorem:
Back to The ๐ Pizza & Pancakes ๐ฅ Drive-Thru
โ๏ธ Exercise 1.2 - Discrete Distributions
We are given the following joint distribution:
๐ \ ๐ฅ | 0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 0 | 0.15 | 0.2 | 0.05 |
1 | 0.08 | 0.03 | ??? | 0.04 |
2 | 0.02 | 0.03 | 0.04 | 0.01 |
3 | 0.1 | 0.05 | 0.05 | 0.1 |
A) What is the missing number in the table?
B) What is the probability of 1 slice of pizza, given that 0 pancakes?
C) what is the probability that 2 independent customers will buy 3 pizza slices?
๐ \ ๐ฅ | 0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 0 | 0.15 | 0.2 | 0.05 |
1 | 0.08 | 0.03 | ??? | 0.04 |
2 | 0.02 | 0.03 | 0.04 | 0.01 |
3 | 0.1 | 0.05 | 0.05 | 0.1 |
A) What is the missing number in the table?
๐ก Solution
A) The sum of all possible event should always be 1, therefore the missing value must be equal to:
๐ \ ๐ฅ | 0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 0 | 0.15 | 0.2 | 0.05 |
1 | 0.08 | 0.03 | 0.05 | 0.04 |
2 | 0.02 | 0.03 | 0.04 | 0.01 |
3 | 0.1 | 0.05 | 0.05 | 0.1 |
B) What is the probability of 1 slice of pizza, given that 0 pancakes?
๐ก Solution
B) By definition:
๐ \ ๐ฅ | 0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 0 | 0.15 | 0.2 | 0.05 |
1 | 0.08 | 0.03 | 0.05 | 0.04 |
2 | 0.02 | 0.03 | 0.04 | 0.01 |
3 | 0.1 | 0.05 | 0.05 | 0.1 |
C) what is the probability that 2 independent customers will buy 3 pizza slices?
๐ก Solution
C)
- Two events -> independent -> The probability is the product of the two probabilities.
- We must sum over all the combinations: (0 & 3, 1 & 2, etc.. ).
๐ก Solution - Cont.
๐ \ ๐ฅ | 0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 0 | 0.15 | 0.2 | 0.05 |
1 | 0.08 | 0.03 | 0.05 | 0.04 |
2 | 0.02 | 0.03 | 0.04 | 0.01 |
3 | 0.1 | 0.05 | 0.05 | 0.1 |
We shall start by calculating the marginal distribution:
๐ก Solution - Cont. 2
Therefore the probability of two customers buying 3 slices is:
โ๏ธ Exercise 1.3 - Mixed Distributions
-
There is a correlation between the length of a car and the number of pizza slices.
-
The length the car, , conditioned on the number of slices , has a normal distribution:
Find the conditional distribution of the number of pizza slices given a car length of 4.4m:
๐ก Solution
According to Bayesโ theorem:
We shell start by calculating the nominator, :
๐ก Solution - Cont.
๐ก Solution - Cont. 2
Now we can easily calculate the denominator (the normalization factor):
Therefore:
๐๏ธ Expectations
The expectation value\Mean
Variance
is the standard deviation
๐๏ธ Expectations - Cont.
Covariance
For a pair of RV, the covariance is defined by:
Covariance Matrix
For a random vector:
This can also be written as:
Gaussian Vectors
(Multivariate Normal Distribution)
A random vector where any linear combination of its components has a normal distribution.
Gaussian Vectors
(Multivariate Normal Distribution) - Cont.
Conditional Distributions
If we split a Gaussian vector into:
And itโs mean and covariance matrix:
The distribution of will also be a multivariate normal distribution
Gaussian Vectors
(Multivariate Normal Distribution) - Cont. 2
Conditional Distributions -Cont.
The conditional distribution will have the following parameters:
โ๏ธ Exercise 1.4
Conditional Multivariate Normal Distribution
A) We are given the following parameters of a Gaussian vector:
Calculate
๐ก Solution
Therefore:
โ๏ธ Exercise 1.4
Conditional Multivariate Normal Distribution
B) We are given the following parameters of a Gaussian vector:
Calculate
๐ก Solution
Therefore:
โ๏ธ Exercise 1.4
Conditional Multivariate Normal Distribution
C) We are given the following parameters of a Gaussian vector:
Calculate
๐ก Solution
Therefore:
Prediction Theory
๐ฎ The Problem
Give an โoptimalโ prediction for the number of slices a customer will buy given that his car length is 4.4m.
We must first define what do we mean by โoptimalโ prediction.
A few optional definitions:
- The prediction with the lowest probability to be wrong.
- The prediction with the lowest error on average.
- The prediction with the lowest squared error on average.
Prediction Theory
๐ฎ The Problem
A few optional definitions:
- The prediction with the lowest probability to be wrong.
- The prediction with the lowest error on average.
- The prediction with the lowest squared error on average.
- Each of the above definitions results in a different prediction.
- There is an infinite amount of ways to define the optimal prediction
- Each definition is preferable in different scenarios.
๐ The Loss and The Risk Functions
In most cases we are searching for:
The prediction which minimizes some penalty function, on average.
Letโs write it in a more formal way.
๐ The Loss and The Risk Functions - Cont.
Notations:
-
- a prediction for .
-
Loss - a function for how bad is a prediction:
-
Risk - The expectation value of the loss function:
The โoptimalโ prediction is the solution to the following optimization problem:
๐ The Loss and The Risk Functions - Cont.
Back to the 3 examples for before:
Definition | Loss | Loss Name | Risk Name |
---|---|---|---|
lowest probability to be wrong | Zero-one loss | Misclassification rate | |
lowest error on average | MAE (mean absolute error) | ||
lowest squared error on average | MSE (mean squared error) |
๐ง A Note About Optimization Problems
Optimization problems are problems of the following form:
- is either a scalar or a vector.
- an arbitrary function called the objective.
- s and s are optional additional equality and inequality constraints.
Goal: to find which produces the minimal value of among all s which obey the constraints.
๐ง A Note About Optimization Problems - Cont.
In this course, we will care about 4 cases:
- No constraints and analytically solvable by differentiating and comparing to zero (for simple functions).
- With equality constraints and analytically solvable using Lagrange Multipliers.
- Brute force, i.e. checking all possible solutions (for discrete cases with a small set of values).
- Using a method called gradient descent, which we will present later in this course.
๐ Solutions For Popular Loss Functions
โ๏ธ Exercise 1.5
A) Show that in the discrete case, the optimal predictor for the zero-one loss is the point with the maximal PMF:
๐ก Solution
A) The zero one-loss is given by , therefore:
๐ Solutions For Popular Loss Functions
โ๏ธ Exercise 1.5
B) Show that in the continuous case, the optimal predictor for the loss is the median:
(in the discrete case it is the point closest to being the median)
๐ก Solution
B) The loss is given by . Therefore:
We will solve this by differentiating according to zero
๐ก Solution - Cont.
๐ Solutions For Popular Loss Functions
โ๏ธ Exercise 1.5
C) Show that the optimal predictor for the loss is the mean:
๐ก Solution
C) The loss is given by . Therefore:
We will solve this by differentiating according to zero
๐ก Solution - Cont.
โ๏ธ Exercise 1.6 - Predicting The Number of Pizza Slices
We would now like to make a prediction for the number of pizza slices given a car length of 4.4m.
Calculate the optimal prediction for each of the 3 loss functions.
๐ก Solution
We have found earlier that:
Therefore:
Zero-one loss:
loss:
The value closest to the median is
loss:
โ๏ธ Exercise 1.7 Heart Rate and Blinking
A new research has found a connection between the blinking rate of a person and his heart rate.
- : A personโs heart rate.
- : A personโs blinking rate.
- has a multivariate normal distribution with:
Find the function which predicts given .
๐ก Solution
Using the formula for the conditional multivariate normal distribution:
Therefore:
๐ก Solution - Cont.
The mean, the median and the point of the maximal distribution are all the same point. Therefore all the three predictors will give the same prediction:
This will always be true for symmetrical distributions such as the normal distribution.
Lagrange Multipliers
A method of for solving optimization problems with only equality constraints:
and the s must all be differentiable.
Lagrange Multipliers - Cont
We will defeine:
- s: one Lagrange multiplier for every constraints.
- Lagrangian:
If is a solution of the original optimization problem, then there exist some set of values s such that:
โ๏ธ Exercise 1.8 Lagrange Multipliers
Solve the following optimization problem:
๐ก Solution
We will introduce one Lagrange multiplier and define the following Lagrangian:
By differentiating and comparing to zero we get the following result: