046195
  • «previous
  • next»
  • Open Slides

Tutorial 3

Parametric Probability Density Estimation

Hands-on

🚖 Reminder: The NYC Taxi Dataset

The first 10 out of 100k taxi rides in NYC.

passenger_count trip_distance payment_type fare_amount tip_amount pickup_easting pickup_northing dropoff_easting dropoff_northing duration day_of_week day_of_month time_of_day
0 2 2.768065 2 9.5 0.00 586.996941 4512.979705 588.155118 4515.180889 11.516667 3 13 12.801944
1 1 3.218680 2 10.0 0.00 587.151523 4512.923924 584.850489 4512.632082 12.666667 6 16 20.961389
2 1 2.574944 1 7.0 2.49 587.005357 4513.359700 585.434188 4513.174964 5.516667 0 31 20.412778
3 1 0.965604 1 7.5 1.65 586.648975 4511.729212 586.671530 4512.554065 9.883333 1 25 13.031389
4 1 2.462290 1 7.5 1.66 586.967178 4511.894301 585.262474 4511.755477 8.683333 2 5 7.703333
5 5 1.561060 1 7.5 2.20 585.926415 4512.880385 585.168973 4511.540103 9.433333 3 20 20.667222
6 1 2.574944 1 8.0 1.00 586.731409 4515.084445 588.710175 4514.209184 7.950000 5 8 23.841944
7 1 0.804670 2 5.0 0.00 585.344614 4509.712541 585.843967 4509.545089 4.950000 5 29 15.831389
8 1 3.653202 1 10.0 1.10 585.422062 4509.477536 583.671081 4507.735573 11.066667 5 8 2.098333
9 6 1.625433 1 5.5 1.36 587.875433 4514.931073 587.701248 4513.709691 4.216667 3 13 21.783056

❓️ Same Problem: Estimating the Distribution of Trip Duration

We would like to estimate the distribution of the rides durations and represent them as a CDF or a PDF.

💡 Method I: Normal Distribution + MLE

  • In this case we will try to use a normal distribution as our parametric model.
  • The model parameters are its mean value and standard deviation .

Assumptions and notations:

  • - The number of samples points in the dataset.
  • - The vector of parameters.
  • - our model.

The negative log-likelihood function for the normal distribution model is then:

Under the MLE approach, the optimal parameters for the model are given by

In the special case of MLE and a normal distribution, the optimization problem can be solved analytically. Sadly, this will not be true in the general case, and we will have to resort to numerical solutions.

We will find the solution for this optimization problem by comparing the derivative of the log-likelihood function to zero.

Which results in our case in:

We will plot the estimated PDF on top of the histogram.

normal

It seems that the normal distribution gives a very rough approximation of the real distribution. In some cases this would be good enough as a first order approximation, but in this case we would like to do better.

One very disturbing fact, for example, is that there is a non zero probability to get negative ride durations, which is obviously not realistic.

Let us try to propose a better model in order to get a better approximation.

💡 Method II : Rayleigh Distribution + MLE

The Rayleigh distribution describes the distribution of the magnitude of a 2D Gaussian vector with zero mean and no correlation between it’s two components. In other words, if has the following distribution:

Than has a Rayleigh distribution.

The PDF of the Rayleigh distribution is given by:

Notice that here the distribution is only defined for positive values. The Rayleigh distribution has only one parameter which is called the scale parameter. Unlike in case of the normal distribution, here is not equal to the standard deviation.

For consistency we will denote the 1D vector of parameters:

We will give a short motivation for preferring the Rayleigh distribution.

Motivation For Using Rayleigh Distribution

We have started with an assumption that the duration a taxi ride is normally distributed. Let us instead assume that the quantity which is normally distributed is the 2D distance , between the pickup location to the drop off location.

In other words, we are assuming that the random variable is a 2D Gaussian vector. For simplicity, we will also assume that the and components of are uncorrelated with equal variance and zero mean, i.e. we assume that: In addition, let us also assume that the taxis speed, is constant. Therefore the relation between the ride duration and the distance vector is:

In this case will have a Rayleigh distribution with a scale parameter .

The model in this case will be:

The negative log-likelihood function will be:

Our optimization problem will now be: This optimization problem can be solved analytically. The solution will be:

Which results in:

rayleigh

Judging by the similarity to the histogram, the Rayleigh distribution does a slightly better job at approximating the distribution and solves the negative values problem.

Let us try one more model.

💡 Method III: Generalized Gamma Distribution + MLE

The Rayleigh distribution is a special case of a more general family of distributions called the Generalized Gamma distribution. The PDF of the Generalized Gamma distribution is given by the following expression:

( here is the gamma function)

This model has 3 parameters:

For and we get the Rayleigh distribution (where $\sigma_{gamma}=2\sigma_{rayleigh}$ ).

Unlike the case of the normal and Rayleigh distributions, here we will not be able to find a simple analytic solution for the optimal MLE parameters. However we can use numerical methods for finding the optimal parameters. In practice we will use SciPy’s model for the General Gamma distribution to find the optimal parameters. You will use a similar method in your homework assignments.

By using SciPy’s numerical solver we get the following result:

generalized_gamma

The Generalized Gamma distribution results in a distribution with a PDF which is much similar to the shape of the histogram.

Attributions