Background
Mathematical Notations
- Scalar: Non-bold lower case letter. Example
- Column vector: Bold lower case letter. Example
- Matrix: Upper case letter. Example
- A set of scalars/vectors: Curly brackets. Example or
- An element of a vector, a matrix or a set: Subscript index. Example or
- The value in an iterative process: Superscript index in parentheses. Example
Probability Notations
- A single random sample:
- A random variable (RV): (usually shorten to . Note that capital letters are also used for matrices)
- A realization of an RV:
- The probability of an event
- Probability Mass Function (PMF): (for a discrete RV)
- Cumulative Distribution Function (CDF):
- Probability Density Function (PDF): define by
Common notation to describe a dataset:
- - The number of samples in a dataset.
- The dimension of the data.
- - A vector of measured values related to a single sample.
- - The set of measurements in a dataset.
- - The matrix of measurements .
(Note that is also used as the a random variable generating ).
- - The -th element of the -th measurement (which is the -th element of the matrix ).
- - The set of labels in a dataset.
- - the vector of labels .
- - The -th label (which is the -th element of the vector ).
Optimization Problems
Optimization problems have the following form:
Where:
- is either a scalar or a vector.
- is some arbitrary function (usually referred to as the objective function)
- The s and s are optional additional equality and inequality constraints on .
The optimal solution of an optimization problem is commonly denoted by “”: for example:
Some method to solve optimization problems
With out constraints:
- Analytic solution
- Gradient descent / Stochastic gradient descent
With only equality constraints:
- Lagrangian multipliers
With general constraints + convex:
- Convex optimization methods
Lagrange Multipliers
An optimization problem with equality constraints can be solved by adding a set of variables , one for each constraint, and define the lagrangian:
If is a solution of the original optimization problem, then there exist some set of values s such that:
Therefore candidates for can be found by differentiating and solving the above equation.
Prediction Theory
The problem of coming up with a mapping which is optimal under some criteria, based on the joint distribution .
- Classification: is discrete.
- Regression: is continues.
The Common Criteria - Minimal Expected Loss
Minimizing the risk/cost function , which is defined as defined as the expectation value of some loss function :
If can be any function than:
If is a parametric function than:
Common Loss Function and Their Optimal Predictors
Loss Name | Risk Name | Loss Function | Optimal Predictor |
---|---|---|---|
Zero-One loss | Misclassification rate | ||
L1 | Mean absolute error |
Median: |
|
L2 | Mean squared error (MSE) |
Zero-one loss is usually used for classification problems and L1 & L2 for regression problems.
Distribution Estimation
Non-Parametric Estimation
Estimating PMF
For discrete RV the PMF can be estimated by:
The Empirical Cumulative Distribution Function (the ECDF)
The ECDF is an estimation of the CDF and is given by:
The ECDF results in a non-continuous CDF which is a sum of step functions.
Histogram
The histogram is an estimation of the PDF and is given by:
- , - The left and right edges of the ’s bin.
A common rule of thumb for selecting the bins is to divide the range of values into equal bins.
Kernel Density Estimation (KDE)
KDE is an estimation of the PDF and is given by:
- - the smoothing Parzen window.
Two commonly used Parzen windows are:
- A Gaussian:
- A rectangular function
A rule of thumb for selecting the bandwidth for the Gaussian window is:
Parametric Estimation
In parametric estimation we assume that has a known form up to some parameters . For example, a Gaussian with unknown mean and STD (in this case ).
Estimating the parameters as random variables (the Bayesian point of view)
Maximum A Posteriori (MAP) Estimation
We have an assumption on the a priori distribution of and we would like to find the most probable given the data in the dataset. I.e. we would like to solve the following optimization problem:
Estimating the parameter as fixed unknowns (the Frequentist point of view)
Maximum Likelihood Estimation (MLE)
Here we have no prior assumption on and we would like to find the most likely given the data.
- The likelyhood function is defined as:
- The log-likelyhood is define as
We would there for want to solve the following optimization problem:
Supervised Learning Principles
Making Predictions for an Unknown Distribution Using a Dataset. There are 3 approaches for using the dataset to solve the problem:
- The generative approach: Use the dataset to infer the joint distribution . Then use the joint distribution to find the optimal predictor . In classification problems we will usually decompose into
- The type I discriminative approach: Use the data to infer the conditional distribution . Then use the conditional distribution to find the optimal predictor .
- The type II discriminative approach: Use the data to directly find by using some direct method or by solving the optimization problem through replacing the risk’s expectation value with an empirical equivalent.
For example, empirical risk minimization (ERM):
Overfitting
When looking for the optimal function , or in the range of all function, we are most likely find a solution which is extremely optimal for the specific dataset but is far from optimal for any other dataset drawn from the same distribution. We would like to find a solution which generalizes well, i.e. will produce reasonable results for any dataset drawn from the data’s (usually unknown) distribution.
Ways to Reduce Overfitting
Restricting the Search Space
We can limit the space in which we look for a solution. The smaller space might not contain the optimal solution, but will hopefully contain a “good enough” approximation of it. One way to do so is by using a parametric family of functions.
Regularization
Add a penalty term to the optimization problem to encourage it to converge to a “simpler” solution. Common penalty terms are:
- Tikhonov Regularization: adding the norm of the parameters vector to the optimization. Encourages the algorithm to select a solution with smaller parameters.
- Regularization: adding the norm of the parameters vector to the optimization. Encourages the algorithm to select a solution with smaller and sparser set of parameters.
The Bias vs. Variance Trade Off
Since the data is randomly generated the resulting prediction function, which is calculated from the data is random as well.
- Bias: The difference between the mean (over different datasets) of the prediction and the optimal one.
- Variance: The variance of the prediction (over different datasets).
When pushing/limiting the algorithm to a specific subset of solutions, the result will usually by a worse approximation of the optimal solution, and will increase the bias of the solution. On the other hand, it will make the algorithm less sensitive to the specific value of each single data point and therefore will reduce the variance. This is known as the bias vs. variance trade off.
Pre-Processing and Using Features
When selecting a model we can always change the input to the model from the measurements themselves to any set of functions of the measurements, , called features.
The vector of measurements will be replace by the vector of features:
And the measurements matrix will be replaced by the features matrix:
An example: we take to be a linear function of the features and use the powers of up to the -th order as our features.
Normalizing
Replacing the measurements with a version of them with zero mean and STD of 1:
This is important when the data comes in arbitrary scale (or units) and the algorithm is sensitive to distances between points.
Dimensionality reduction
One way of making the model “simpler” is to use a smaller but more meaningful set of input features. One way of doing so is by using the PCA algorithm.
Principal Component Analysis (PCA)
A linear projection of the data on to the most significant direction in the data space.
The sample-covariance matrix:
( is a positive semi-definite (PSD) matrix)
The eigen-decomposition of :
Where is the unitary matrix with the eigen-vectors, , of on each column and is the a diagonal matrix with the eigen-values on the diagonal sorted from large to small.
The PCA projection is performed by projecting a measurements vector onto the first columns of (the eiven-vector which correspond to the largest eigen-values).
Parameters and Hyper-Parameters
Usually an algorithm will have some parameters which can be optimally selected using an analytic or numeric method, and some parameter which will have to be selected by running a grid search or a by trail and error. These parameters are refer to to as hyper-parameters.
Validation Set
Evaluation of the hyper-parameters is usually done on a validation set different then the train and the test sets.
Cross Validation
In stead of using a designated validation-set we can divide the train set into subsets and repeat the evaluation times, each time using a different subset as the validation set and average the results.
Supervised Learning Algorithms
Linear Least Squares (LLS, also known as ordinary least squares-OLS)
- Solves: Regression problems with MSE risk
- Approach: Discriminative type II
- Model:
- Parameters:
Optimization problem:
Has a closed form solution:
Ridge Regression - LLS with Tikhonov Regularization ()
Optimization problem:
- Hyper-Parameters:
Has a closed form solution:
- Very naive model. CAn be useful in very simple problems or when very good features are used.
Least Absolute Shrinkage and Selection Operator (LASSO) - LLS with Regularization
Optimization problem:
- Hyper-Parameters:
Has an efficient numerical solution.
K-Nearest Neighbors (K-NN)
- Solves: Classification and regression problems
- Approach: Discriminative type II
- Non-Parametric Model
- Hyper-Parameters:
finds the euclidean nearest neighbors to form the training set, and return the mean value for regression, or the majority vote value for classification of the the labels of the neighbors.
- Requires amount of data which is exponential in the dimension.
Linear Discriminant Analysis (LDA)
- Solves: Classification problems
- Approach: Generative
- Model:
- Parameters: and
Computing the model’s parameters:
For misclassification rate:
Which in the binary case reduces to:
Where:
- Produces good results when the data distribution of each class can be well approximated by a Gaussian (mainly when it is concentrated in and oval like area)
Quadric Discriminant Analysis (QDA)
Similar to LDA but with a different covariance matrix for each class.
- Solves: Classification problems
- Approach: Generative
- Model:
- Parameters: and
Computing the model’s parameters is similar to LDA but with:
For misclassification rate:
Which in the binary case reduces to:
Where:
Linear Logistic Regression
- Solves: Classification problems
- Approach: Discriminative type I
-
Model:
- Parameters: . Without loss of generality we can set
- Hyper-Parameters: the optimization solver parameters.
Optimization problem (MLE):
$$
- Solved using gradient decent
in the binary case reduces to:
For misclassification rate:
Multi Layer Perceptron (MLP)
- Solves:
- Regression - By using a discriminative type II approach
- Classification - By using a discriminative type I approach
- Model: A neural network of fully connected layers.
- Parameters: The weights of the fully connected layers
- Hyper-Parameters: The number of layers and the width of each layer + the optimization solver parameters.
-
Solved using gradient decent (using back-propagation).
- Usually has a lot of parameters and tends to overfit.
Convolutional Neural Network (CNN)
- Solves:
- Regression - By using a discriminative type II approach
- Classification - By using a discriminative type I approach
- Model: A neural network of convolutional layers and fully connected layers.
- Parameters: The weights of the layers
- Hyper-Parameters: The number of layers and the parameters of each layer.
-
Solved using gradient decent (using back-propagation).
- Works very well when the data has some spatial structure.
Support Vector Machine (SVM)
Hard SVM
- Solves: Binary Classification problems.
- Approach: Discriminative type II.
- Model:
- Parameters:
-
Solved using convex optimization methods.
- Works very well when the data is linearly separable.
Soft SVM
- Solves: Binary Classification problems.
- Approach: Discriminative type II.
- Model:
- Adds slack variables in the optimization process to deal with data which is not linearly separable.
- Parameters:
- Hyper-parameters: The weighting of the slack variables.
-
Solved using convex optimization methods.
- Works very well when the data is close to being linearly separable.
Decision Trees
- Solves: Regression or classification
- Discriminative type II
- Model a decision tree
- Parameters: the decision on brach branch
- Hyper-parameters: The number of branches / stopping criteria and the metric.
- Tree is constructed top down in a greedy approach by selecting at each branch the decision in the following way:
- For each possible decision i calculate the resulting sub-datasets .
- Evaluate the selected impurity metric on each sub dataset:
- Evaluate the weighted impurity of the decision as
- Select the decision with the lowest weighted impurity
Popular metrics
We define the empirical probability of class in a dataset as .
Some popular impurity metrics are:
- Misclassification:
- Gini:
- Entropy:
Other subjects which are missing out
- Perceptron
- Bagging
- Boosting
- K-Means
- Kernel methods