Tutorial 1
Probability Theory and Predictions

Print

Background - Probability Theory Review


  • Goal: produce a system which makes predictions based on observed data.

  • Method: build a model of the relationship the observed and the target data.

  • One common ways is to use a probabilistic model.


  • Most of the course: we will focus on building the model.

  • In this tutorial: we will focus on making predictions given a probabilistic model.

๐ŸŽฐ Basic Concepts


An example for a random phenomenon:



We take a glass full of juice and pour it on the floor (donโ€™t try this at home) and look at the shape of the resulting spill.

๐ŸŽฐ Basic Concepts - Cont.

๐ŸŽฐ Basic Concepts - Cont. 2

name Usually donated by Definition Example
A random phenomenon - Some process which generates random outcomes Spilling a glass of juice on the floor and examining at the shape of the spill
A sample A single outcome of the process Some specific spill shape
Sample space The space of all possible outcomes of the given process. The space of all possible spill shapes
Random Variables (RV) ,,โ€ฆ A function which assigns a real number to a given sample A function which returns the perimeter of a spill:

A function which returns the area of a spill:
> An event ,,โ€ฆ A collection of events, i.e., a subset of the sample space .
We would often define an event through a condition on random variables.
The collection of all spills with a perimeter smaller than 2:

The collection of all spills with an area larger than 1:
Event space A space of events. The space of all possible collections of spills shape
Probability measure A function which returns the probability of a random sample to be an element in some event



Conditional probability measure A function which returns the probability of a random sample to be an element in event given that it is an element in event The probability of a spill to have a diameter smaller than 2, given that it has an area larger than 1:

๐ŸŽฐ Basic Concepts - Notations

  • Realizations: The outcome of an RV (random variable) for a given sample:



  • We will use as a shorthand for


  • Any function of an random variable (RV) is also a RV.
    For example or .


Slight abuse of notation: We will use the name samples to refer to the realizations.

๐ŸŽฐ Basic Concepts - Cont. 4

Random Vectors


A series of RVs packed together in a vector:


and itโ€™s realization:

Toy Model:
The ๐Ÿ• Pizza & Pancakes ๐Ÿฅž Drive-Thru




We would like to help a new local drive-thru business that sells pizzas and pancakes. Which would like to be able to make some predictions regarding his business.

โœ๏ธ Exercise 1.1 - Warm-Up

The random phenomenon we will be interested in is that of a customer making an order. Our sample will a single customer making an order. Give examples for:

  • 2 discrete random variables
  • 2 continuous random variables
  • 2 events
  • 2 possible probability measures.
  • A possible probability measure using the union of the 2 events.
  • A possible probability measure using the intersection of the 2 events.
  • A possible probability measure using the exclusion of the 2 events.
  • A possible conditional probability measure.

Make sure that all the probability measures are consistent with one another.

๐Ÿ’ก Solution
  • 2 discrete random variables

    • $X\left(\omega\right)$: The number of pizza slices the customer ordered.๐Ÿ•
    • $Y\left(\omega\right)$: The number of pancakes slices the customer ordered.๐Ÿฅž

  • 2 continuous random variables

    • $L\left(\omega\right)$: The length of the customer's car in meters. ๐Ÿš—
    • $T\left(\omega\right)$: The amount of tip the customer left in dollars. ๐Ÿ’ต

  • 2 events

    • The customer ordered exactly 1 slice of pizza: $X=1$.
    • The customer left more then 2 USD tip $T>2$.

๐Ÿ’ก Solution - Cont
  • 2 possible probability measures.

    • The prob. a customer will order 1 slice: $$Pr\left(X=1\right)=0.2$$.
    • The prob. a customer will leave more than 2 USD tip: $$Pr\left(T>2\right)=0.5$$

  • A possible prob. measure using the union of the 2 events.

    • The prob. of either "will order 1 slice" or "tip more than 2 USD" $$Pr\left(X=1 \cup T>2\right)=0.6$$

๐Ÿ’ก Solution - Cont 2
  • A possible probability measure using the intersection of the 2 events.

    • The prob. of "order 1 slice" and "tip more than 2 USD" $$Pr\left(X=1 \cap T>2\right)=0.1$$

  • A possible probability measure using the exclusion of the 2 events.

    • The prob. of "order 1 slice" but not "tip more than 2 USD" $$Pr\left(X=1 - T>2\right)=0.4$$

  • A possible conditional probability measure.

    • The prob. of "order 1 slice" given that "tip more than 2 USD" $$Pr\left(X=1 \lvert T>2\right)=0.5$$

๐Ÿ“Š Distributions

Distribution Functions



CDF (Cumulative Distribution Function)

Distribution Functions - Cont.

PMF (Probability Mass Function) - (Discrete)


PDF (Probability Density Function) - (Continuous)

Or:

Conditional Distribution

CDF
PMF
PDF

Note: A scalar RV is simply a random vector of length 1.

Some Basic Rules

The law of total probability (Marginal Probability)


The Conditional Distributions (Definition):

Some Basic Rules - Cont.

Bayesโ€™ Theorem:


Back to The ๐Ÿ• Pizza & Pancakes ๐Ÿฅž Drive-Thru

โœ๏ธ Exercise 1.2 - Discrete Distributions

We are given the following joint distribution:

๐Ÿ• \ ๐Ÿฅž 0 1 2 3
0 0 0.15 0.2 0.05
1 0.08 0.03 ??? 0.04
2 0.02 0.03 0.04 0.01
3 0.1 0.05 0.05 0.1

A) What is the missing number in the table?

B) What is the probability of 1 slice of pizza, given that 0 pancakes?

C) what is the probability that 2 independent customers will buy 3 pizza slices?

๐Ÿ• \ ๐Ÿฅž 0 1 2 3
0 0 0.15 0.2 0.05
1 0.08 0.03 ??? 0.04
2 0.02 0.03 0.04 0.01
3 0.1 0.05 0.05 0.1

A) What is the missing number in the table?

๐Ÿ’ก Solution

A) The sum of all possible event should always be 1, therefore the missing value must be equal to:

๐Ÿ• \ ๐Ÿฅž 0 1 2 3
0 0 0.15 0.2 0.05
1 0.08 0.03 0.05 0.04
2 0.02 0.03 0.04 0.01
3 0.1 0.05 0.05 0.1

B) What is the probability of 1 slice of pizza, given that 0 pancakes?

๐Ÿ’ก Solution

B) By definition:

๐Ÿ• \ ๐Ÿฅž 0 1 2 3
0 0 0.15 0.2 0.05
1 0.08 0.03 0.05 0.04
2 0.02 0.03 0.04 0.01
3 0.1 0.05 0.05 0.1

C) what is the probability that 2 independent customers will buy 3 pizza slices?

๐Ÿ’ก Solution

C)

  • Two events -> independent -> The probability is the product of the two probabilities.
  • We must sum over all the combinations: (0 & 3, 1 & 2, etc.. ).
๐Ÿ’ก Solution - Cont.
๐Ÿ• \ ๐Ÿฅž 0 1 2 3
0 0 0.15 0.2 0.05
1 0.08 0.03 0.05 0.04
2 0.02 0.03 0.04 0.01
3 0.1 0.05 0.05 0.1

We shall start by calculating the marginal distribution:

๐Ÿ’ก Solution - Cont. 2


Therefore the probability of two customers buying 3 slices is:

โœ๏ธ Exercise 1.3 - Mixed Distributions

  • There is a correlation between the length of a car and the number of pizza slices.

  • The length the car, , conditioned on the number of slices , has a normal distribution:

Find the conditional distribution of the number of pizza slices given a car length of 4.4m:

๐Ÿ’ก Solution

According to Bayesโ€™ theorem:


We shell start by calculating the nominator, :

๐Ÿ’ก Solution - Cont.
๐Ÿ’ก Solution - Cont. 2

Now we can easily calculate the denominator (the normalization factor):


Therefore:

๐Ÿ—œ๏ธ Expectations

The expectation value\Mean


Variance


is the standard deviation

๐Ÿ—œ๏ธ Expectations - Cont.

Covariance

For a pair of RV, the covariance is defined by:


Covariance Matrix

For a random vector:

This can also be written as:

Gaussian Vectors
(Multivariate Normal Distribution)


A random vector where any linear combination of its components has a normal distribution.


Gaussian Vectors
(Multivariate Normal Distribution) - Cont.

Conditional Distributions

If we split a Gaussian vector into:

And itโ€™s mean and covariance matrix:

The distribution of will also be a multivariate normal distribution

Gaussian Vectors
(Multivariate Normal Distribution) - Cont. 2

Conditional Distributions -Cont.

The conditional distribution will have the following parameters:


โœ๏ธ Exercise 1.4
Conditional Multivariate Normal Distribution

A) We are given the following parameters of a Gaussian vector:

Calculate

๐Ÿ’ก Solution



Therefore:

โœ๏ธ Exercise 1.4
Conditional Multivariate Normal Distribution

B) We are given the following parameters of a Gaussian vector:

Calculate

๐Ÿ’ก Solution



Therefore:

โœ๏ธ Exercise 1.4
Conditional Multivariate Normal Distribution

C) We are given the following parameters of a Gaussian vector:

Calculate

๐Ÿ’ก Solution



Therefore:

Prediction Theory

๐Ÿ”ฎ The Problem

Give an โ€œoptimalโ€ prediction for the number of slices a customer will buy given that his car length is 4.4m.


We must first define what do we mean by โ€œoptimalโ€ prediction.


A few optional definitions:

  1. The prediction with the lowest probability to be wrong.
  2. The prediction with the lowest error on average.
  3. The prediction with the lowest squared error on average.

Prediction Theory

๐Ÿ”ฎ The Problem

A few optional definitions:

  1. The prediction with the lowest probability to be wrong.
  2. The prediction with the lowest error on average.
  3. The prediction with the lowest squared error on average.


  • Each of the above definitions results in a different prediction.
  • There is an infinite amount of ways to define the optimal prediction
  • Each definition is preferable in different scenarios.

๐Ÿ“‰ The Loss and The Risk Functions


In most cases we are searching for:


The prediction which minimizes some penalty function, on average.


Letโ€™s write it in a more formal way.

๐Ÿ“‰ The Loss and The Risk Functions - Cont.

Notations:

  • - a prediction for .

  • Loss - a function for how bad is a prediction:

  • Risk - The expectation value of the loss function:


The โ€œoptimalโ€ prediction is the solution to the following optimization problem:


๐Ÿ“‰ The Loss and The Risk Functions - Cont.


Back to the 3 examples for before:


Definition Loss Loss Name Risk Name
lowest probability to be wrong Zero-one loss Misclassification rate
lowest error on average MAE (mean absolute error)
lowest squared error on average MSE (mean squared error)

๐Ÿ”ง A Note About Optimization Problems


Optimization problems are problems of the following form:


  • is either a scalar or a vector.
  • an arbitrary function called the objective.
  • s and s are optional additional equality and inequality constraints.


Goal: to find which produces the minimal value of among all s which obey the constraints.

๐Ÿ”ง A Note About Optimization Problems - Cont.

In this course, we will care about 4 cases:


  • No constraints and analytically solvable by differentiating and comparing to zero (for simple functions).


  • With equality constraints and analytically solvable using Lagrange Multipliers.


  • Brute force, i.e. checking all possible solutions (for discrete cases with a small set of values).


  • Using a method called gradient descent, which we will present later in this course.



โœ๏ธ Exercise 1.5

A) Show that in the discrete case, the optimal predictor for the zero-one loss is the point with the maximal PMF:


๐Ÿ’ก Solution

A) The zero one-loss is given by , therefore:



โœ๏ธ Exercise 1.5

B) Show that in the continuous case, the optimal predictor for the loss is the median:



(in the discrete case it is the point closest to being the median)

๐Ÿ’ก Solution

B) The loss is given by . Therefore:



We will solve this by differentiating according to zero

๐Ÿ’ก Solution - Cont.



โœ๏ธ Exercise 1.5

C) Show that the optimal predictor for the loss is the mean:


๐Ÿ’ก Solution

C) The loss is given by . Therefore:



We will solve this by differentiating according to zero

๐Ÿ’ก Solution - Cont.

โœ๏ธ Exercise 1.6 - Predicting The Number of Pizza Slices


We would now like to make a prediction for the number of pizza slices given a car length of 4.4m.


Calculate the optimal prediction for each of the 3 loss functions.

๐Ÿ’ก Solution

We have found earlier that:

Therefore:

Zero-one loss:

loss:

The value closest to the median is

loss:

โœ๏ธ Exercise 1.7 Heart Rate and Blinking

A new research has found a connection between the blinking rate of a person and his heart rate.

  • : A personโ€™s heart rate.
  • : A personโ€™s blinking rate.
  • has a multivariate normal distribution with:


Find the function which predicts given .

๐Ÿ’ก Solution

Using the formula for the conditional multivariate normal distribution:


Therefore:

๐Ÿ’ก Solution - Cont.

The mean, the median and the point of the maximal distribution are all the same point. Therefore all the three predictors will give the same prediction:




This will always be true for symmetrical distributions such as the normal distribution.

Lagrange Multipliers

A method of for solving optimization problems with only equality constraints:



and the s must all be differentiable.


Lagrange Multipliers - Cont


We will defeine:

  • s: one Lagrange multiplier for every constraints.
  • Lagrangian:



If is a solution of the original optimization problem, then there exist some set of values s such that:

โœ๏ธ Exercise 1.8 Lagrange Multipliers


Solve the following optimization problem:


๐Ÿ’ก Solution

We will introduce one Lagrange multiplier and define the following Lagrangian:


By differentiating and comparing to zero we get the following result: