תקציר הנושאים בקורס
הבהרה: זהו רק תקציר בראשי פרקים של רוב החומרים שנלמדו בקורס. החומר למבחן כולל את כל הנושאים אשר נלמדו (גם אם לא הופיעו בתקציר זה)
Background
Mathematical Notations
- Scalar:
- Column vector:
- Matrix:
- A set of scalars/vectors: or
- An element of a vector, a matrix or a set: or
- The value in an iterative process:
Probability Notations
- A single random sample:
- A random variable (RV): (usually shorten to . Note that capital letters are also used for matrices)
- A realization of an RV:
- The probability of an event
- Probability Mass Function (PMF): (for a discrete RV)
- Cumulative Distribution Function (CDF):
- Probability Density Function (PDF): define by
Common notation to describe a dataset:
- - The number of samples in a dataset.
- A realization of an RV:
- The dimension of the data.
- - A vector of measured values related to a single sample.
- - The set of measurements in a dataset.
- - The matrix of measurements .
- - The -th element of the -th measurement (which is the -th element of the matrix ).
- - The set of labels in a dataset.
- - the vector of labels .
- - The -th label (which is the -th element of the vector ).
Optimization Problems
Where:
- is a scalar or a vector.
- the objective function
- The s and s are equality and inequality constraints.
Some method to solve optimization problems
With out constraints:
- Analytic solution
- Gradient descent / Stochastic gradient descent
With only equality constraints:
- Lagrangian multipliers
With general constraints + convex:
- Convex optimization methods
Lagrange Multipliers
We add a set of variables and define the lagrangian:
If is a solution of the original optimization problem, then there exist some set of values s such that:
Therefore candidates for can be found by differentiating and solving the above equation.
Prediction Theory
Finding a mapping which is optimal under some criteria, based on .
- Classification: is discrete.
- Regression: is continues.
The Common Criteria - Minimal Expected Loss
Minimizing a risk/cost function :
If can be any function than:
If is a parametric function than:
Common Loss Function and Their Optimal Predictors
Loss Name | Risk Name | Loss Function | Optimal Predictor |
---|---|---|---|
Zero-One loss | Misclassification rate | ||
L1 | Mean absolute error |
Median: |
|
L2 | Mean squared error (MSE) |
Zero-one loss is usually used for classification problems and L1 & L2 for regression problems.
Distribution Estimation
Non-Parametric Estimation
Estimating PMF
For discrete RV the PMF can be estimated by:
The Empirical Cumulative Distribution Function (the ECDF)
The ECDF is an estimation of the CDF and is given by:
The ECDF results in a non-continuous CDF which is a sum of step functions.
Histogram
The histogram is an estimation of the PDF and is given by:
- , - The left and right edges of the ’s bin.
A common rule of thumb for selecting the bins is to divide the range of values into equal bins.
Kernel Density Estimation (KDE)
KDE is an estimation of the PDF and is given by:
- - the smoothing Parzen window.
Two commonly used Parzen windows are:
- A Gaussian:
- A rectangular function
A rule of thumb for selecting the bandwidth for the Gaussian window is:
Parametric Estimation
We assume that has a known form up to some parameters .
For example, a Gaussian with unknown mean and STD .
Estimating the parameters as random variables (the Bayesian point of view)
Maximum A Posteriori (MAP) Estimation
We assume / have a priori distribution .
we would like to find the most probable :
Estimating the parameter as fixed unknowns (the Frequentist point of view)
Maximum Likelihood Estimation (MLE)
No prior assumption on .
We would like to find the most likely given the data.
- The likelyhood function:
- The log-likelyhood function:
Supervised Learning Principles
3 approaches for using the dataset to solve the problem:
-
The generative approach: dataset -> -> .
-
The type I discriminative approach: dataset -> -> .
-
The type II discriminative approach: dataset -> .
Overfitting
The optimal function , or over the dataset is different than the optimal in general, therefore the optimal solution over the dataset will not generalize well to new data.
Ways to Reduce Overfitting
Restricting the Search Space
Limit the space in which we look for a solution and hope to find a “good enough” solution.
Usually done by using a parametric family of functions.
Ways to Reduce Overfitting
Regularization
Add a penalty term to the optimization problem to encourage it to converge to a “simpler” solution.
Common penalty terms are:
- Tikhonov Regularization: norm. Encourages solutions with smaller parameters.
- Regularization: Encourages solutions with smaller and sparser parameters.
Ways to Reduce Overfitting
Bagging
Not covered in this tutorial
The Bias vs. Variance Trade Off
The data is random therefore the generated model in random and has a:
-
Bias: The difference between the mean (over different datasets) of the prediction and the optimal one.
-
Variance: The variance of the prediction (over different datasets).
By using regularization or limiting the model to a small space the bias will increase but the variance will usually decrease. This is known as the bias vs. variance trade off.
Pre-Processing and Using Features
We can replace inputs from the measurements to any set of functions of them:
The vector of measurements will be replace by the vector of features:
And the measurements matrix will be replaced by the features matrix:
Normalizing
Replacing the measurements with a version of them with zero mean and STD of 1:
Important when the data comes in arbitrary scale (or units).
Dimensionality reduction
One way of making the model “simpler” is to use smaller but more meaningful features. One way of doing so is by using the PCA algorithm.
Principal Component Analysis (PCA)
Finds the significant direction to project the data on.
- The sample-covariance matrix: (a positive semi-definite (PSD) matrix)
- The eigen-decomposition of : .
was the eigen-vectors, , of on each column.
is the a diagonal matrix of sorted eigen-values. - The PCA features:
Parameters and Hyper-Parameters
- Parameters: can be optimally selected using an analytic or numeric method.
- Hyper-parameters: need to be selected by running a grid search or a by trail and error.
Validation Set
A dataset different than the train or test set used to evaluate and select the best hyper-parameters.
Cross Validation
In stead of using a single validation set for selecting the hyper-parameters we can do the following:
- Divide the train set into subsets
- For a given set of hyper-parameters, fit the model and evaluate times, using a different set as a validation set each time.
- Evaluate the score of the model as the average of the K evaluations
Supervised Learning Algorithms
Linear Least Squares
(LLS, also known as ordinary least squares-OLS)
- Solves: Regression problems with MSE risk
- Approach: Discriminative type II
- Model:
- Parameters:
Optimization problem:
Has a closed form solution:
Ridge Regression - LLS with Tikhonov Regularization ()
Optimization problem:
- Hyper-Parameters:
Has a closed form solution:
- Very naive model. CAn be useful in very simple problems or when very good features are used.
Least Absolute Shrinkage and Selection Operator (LASSO) - LLS with Regularization
Optimization problem:
- Hyper-Parameters:
Has an efficient numerical solution.
K-Nearest Neighbors (K-NN)
- Solves: Classification and regression problems
- Approach: Discriminative type II
- Non-Parametric Model
- Hyper-Parameters:
finds the euclidean nearest neighbors to form the training set, and return the mean value for regression, or the majority vote value for classification of the the labels of the neighbors.
- Requires amount of data which is exponential in the dimension.
Linear Discriminant Analysis (LDA)
- Solves: Classification problems
- Approach: Generative
- Model:
- Parameters: and
Computing the model’s parameters:
For misclassification rate:
Which in the binary case reduces to:
Where:
- Produces good results when the classes are distributed in localized groups.
Quadric Discriminant Analysis (QDA)
Similar to LDA but with a different covariance matrix for each class.
- Solves: Classification problems
- Approach: Generative
- Model:
- Parameters: and
Computing the model’s parameters is similar to LDA but with:
For misclassification rate:
Which in the binary case reduces to:
Where:
Linear Logistic Regression
- Solves: Classification problems
- Approach: Discriminative type I
- Model:
- Parameters: . Without loss of generality we can set
- Hyper-Parameters: the optimization solver parameters.
Optimization problem (MLE):
Solved using gradient decent
in the binary case reduces to:
For misclassification rate:
Multi Layer Perceptron (MLP)
- Solves:
- Regression - By using a discriminative type II approach
- Classification - By using a discriminative type I approach
- Model: A neural network of fully connected layers.
- Parameters: The weights of the fully connected layers
- Hyper-Parameters: The number of layers and the width of each layer + the optimization solver parameters.
-
Solved using gradient decent (using back-propagation).
-
Usually has a lot of parameters and tends to overfit.
Convolutional Neural Network (CNN)
- Solves:
- Regression - By using a discriminative type II approach
- Classification - By using a discriminative type I approach
- Model: A neural network of convolutional layers and fully connected layers.
- Parameters: The weights of the layers
- Hyper-Parameters: The number of layers and the parameters of each layer.
- Solved using gradient decent (using back-propagation).
Works very well when the data has some spatial structure.
Support Vector Machine (SVM)
Hard SVM
- Solves: Binary Classification problems.
- Approach: Discriminative type II.
- Model:
- Parameters:
-
Solved using convex optimization methods.
-
Works very well when the data is linearly separable.
Support Vector Machine (SVM)
Soft SVM
- Solves: Binary Classification problems.
- Approach: Discriminative type II.
- Model:
- Adds slack variables in the optimization process to deal with data which is not linearly separable.
- Parameters:
- Hyper-parameters: The weighting of the slack variables.
- Solved using convex optimization methods.
Works very well when the data is close to being linearly separable.
Decision Trees
- Solves: Regression or classification
- Approach: Discriminative type II
- Model: a decision tree
- Parameters: the decision on brach branch
- Hyper-parameters: The number of branches / stopping criteria and the metric.
construction by a top down greedy approach for the branches:
- For each possible decision i calculate the resulting sub-datasets .
- Evaluate the selected impurity metric on each sub dataset:
- Evaluate the weighted impurity of the decision as
- Select the decision with the lowest weighted impurity
Popular metrics
We define the empirical probability of class in a dataset as .
Some popular impurity metrics are:
- Misclassification:
- Gini:
- Entropy:
Other subjects which are missing out
- Perceptron
- Bagging
- Boosting
- K-Means
- Kernel methods