Hands-on
🚖 Reminder: The NYC Taxi Dataset
The first 10 out of 100k taxi rides in NYC.
passenger_count | trip_distance | payment_type | fare_amount | tip_amount | pickup_easting | pickup_northing | dropoff_easting | dropoff_northing | duration | day_of_week | day_of_month | time_of_day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 2.768065 | 2 | 9.5 | 0.00 | 586.996941 | 4512.979705 | 588.155118 | 4515.180889 | 11.516667 | 3 | 13 | 12.801944 |
1 | 1 | 3.218680 | 2 | 10.0 | 0.00 | 587.151523 | 4512.923924 | 584.850489 | 4512.632082 | 12.666667 | 6 | 16 | 20.961389 |
2 | 1 | 2.574944 | 1 | 7.0 | 2.49 | 587.005357 | 4513.359700 | 585.434188 | 4513.174964 | 5.516667 | 0 | 31 | 20.412778 |
3 | 1 | 0.965604 | 1 | 7.5 | 1.65 | 586.648975 | 4511.729212 | 586.671530 | 4512.554065 | 9.883333 | 1 | 25 | 13.031389 |
4 | 1 | 2.462290 | 1 | 7.5 | 1.66 | 586.967178 | 4511.894301 | 585.262474 | 4511.755477 | 8.683333 | 2 | 5 | 7.703333 |
5 | 5 | 1.561060 | 1 | 7.5 | 2.20 | 585.926415 | 4512.880385 | 585.168973 | 4511.540103 | 9.433333 | 3 | 20 | 20.667222 |
6 | 1 | 2.574944 | 1 | 8.0 | 1.00 | 586.731409 | 4515.084445 | 588.710175 | 4514.209184 | 7.950000 | 5 | 8 | 23.841944 |
7 | 1 | 0.804670 | 2 | 5.0 | 0.00 | 585.344614 | 4509.712541 | 585.843967 | 4509.545089 | 4.950000 | 5 | 29 | 15.831389 |
8 | 1 | 3.653202 | 1 | 10.0 | 1.10 | 585.422062 | 4509.477536 | 583.671081 | 4507.735573 | 11.066667 | 5 | 8 | 2.098333 |
9 | 6 | 1.625433 | 1 | 5.5 | 1.36 | 587.875433 | 4514.931073 | 587.701248 | 4513.709691 | 4.216667 | 3 | 13 | 21.783056 |
❓️ Same Problem: Estimating the Distribution of Trip Duration
We would like to estimate the distribution of the rides durations and represent them as a CDF or a PDF.
💡 Method I: Normal Distribution + MLE
- In this case we will try to use a normal distribution as our parametric model.
- The model parameters are its mean value and standard deviation .
Assumptions and notations:
- - The number of samples points in the dataset.
- - The vector of parameters.
- - our model.
The negative log-likelihood function for the normal distribution model is then:
Under the MLE approach, the optimal parameters for the model are given by
In the special case of MLE and a normal distribution, the optimization problem can be solved analytically. Sadly, this will not be true in the general case, and we will have to resort to numerical solutions.
We will find the solution for this optimization problem by comparing the derivative of the log-likelihood function to zero.
Which results in our case in:
We will plot the estimated PDF on top of the histogram.
It seems that the normal distribution gives a very rough approximation of the real distribution. In some cases this would be good enough as a first order approximation, but in this case we would like to do better.
One very disturbing fact, for example, is that there is a non zero probability to get negative ride durations, which is obviously not realistic.
Let us try to propose a better model in order to get a better approximation.
💡 Method II : Rayleigh Distribution + MLE
The Rayleigh distribution describes the distribution of the magnitude of a 2D Gaussian vector with zero mean and no correlation between it’s two components. In other words, if has the following distribution:
Than has a Rayleigh distribution.
The PDF of the Rayleigh distribution is given by:
Notice that here the distribution is only defined for positive values. The Rayleigh distribution has only one parameter which is called the scale parameter. Unlike in case of the normal distribution, here is not equal to the standard deviation.
For consistency we will denote the 1D vector of parameters:
We will give a short motivation for preferring the Rayleigh distribution.
Motivation For Using Rayleigh Distribution
We have started with an assumption that the duration a taxi ride is normally distributed. Let us instead assume that the quantity which is normally distributed is the 2D distance , between the pickup location to the drop off location.
In other words, we are assuming that the random variable is a 2D Gaussian vector. For simplicity, we will also assume that the and components of are uncorrelated with equal variance and zero mean, i.e. we assume that: In addition, let us also assume that the taxis speed, is constant. Therefore the relation between the ride duration and the distance vector is:
In this case will have a Rayleigh distribution with a scale parameter .
The model in this case will be:
The negative log-likelihood function will be:
Our optimization problem will now be: This optimization problem can be solved analytically. The solution will be:
Which results in:
Judging by the similarity to the histogram, the Rayleigh distribution does a slightly better job at approximating the distribution and solves the negative values problem.
Let us try one more model.
💡 Method III: Generalized Gamma Distribution + MLE
The Rayleigh distribution is a special case of a more general family of distributions called the Generalized Gamma distribution. The PDF of the Generalized Gamma distribution is given by the following expression:
( here is the gamma function)
This model has 3 parameters:
For and we get the Rayleigh distribution (where $\sigma_{gamma}=2\sigma_{rayleigh}$ ).
Unlike the case of the normal and Rayleigh distributions, here we will not be able to find a simple analytic solution for the optimal MLE parameters. However we can use numerical methods for finding the optimal parameters. In practice we will use SciPy’s model for the General Gamma distribution to find the optimal parameters. You will use a similar method in your homework assignments.
By using SciPy’s numerical solver we get the following result:
The Generalized Gamma distribution results in a distribution with a PDF which is much similar to the shape of the histogram.