These workshops come to accompany the lectures and recitations by taking another dive into the subjects of this course, but from a more "hands-on" approach. The purpose of the workshops is to achieve the following goals:
The hands-on part ("wet" part) of the homework assignment will be of the same form of implementation section of the workshops.
In many cases, the code and algorithms in these workshops are written in a very inefficient way. We have preferred to keep the code readable and as close as possible to the algorithm described in class. In almost all cases, there already exist packages implementing these algorithms efficiently and robustly, when possible we have tried to give references to these implementations.
In this workshop, we will mainly be discussing the process which we will be using for solving the problem. The rest of the workshops will mainly discuss the different approaches and methods which we can use when applying this process.
?Here are some motivational points for why we decided on teaching using Python, and why should you learn it?
We will start by jumping directly into the water by applying a simple solution to a popular introductory problem
At this point, we will do it in a very sloppy way, and the solution we will get will be very far from idle. The goal here is only to build the overall idea of the type of problems we will be trying to solve and the general process we are going to apply for solving them.
We will later come back to this exact problem again and solve it using some more advanced tools while defining everything in a much more formal way.
In this problem, we would like to predict whether or not a given passenger has survived the Titanic tragedy based only the passengers' details in manifest which contains some cold facts such as age, sex, the ticket class, etc.
To be able to come up with a good prediction method we are given a portion of the passenger manifest along with the data of whether or not these passengers have survived. The goal would be to be able to provide a good prediction for the survival chances for passengers outside this list.
Here are some reasons to believe why we might be able to improve our guess based on these cold facts are:
Throughout the workshops, we will try to build and follow a certain workflow which we will apply to the problems we come to solve. For now, we will only give a general description of the process, and as we advance we will gain a better understanding and obtain more tools for applying in each of the steps. We will start with the following diagram and make some small adaptations to it as we advance in the course:

This course we will mostly focus on the two steps of suggesting a model and a learning method and applying them.
Let us try to understand this workflow by applying it to the problem at hand. For now, we will only give a very general description of each step, in a future workshop, we will provide a more in-depth descriptions for each of the steps.
We will start by importing some useful packages
## Importing packages
import numpy as np # Numerical package (mainly multi-dimensional arrays and linear algebra)
import pandas as pd # A package for working with data frames
import matplotlib.pyplot as plt # A plotting package
%matplotlib inline
## A function to add Latex (equations) to output which works also in Google Colabrtroy
## In a regular notebook this could simply be replaced with "display(Markdown(x))"
from IPython.display import HTML
def print_math(x): # Define a function to preview markdown outputs as HTML using mathjax
display(HTML(''.join(['<p><script type="text/x-mathjax-config">MathJax.Hub.Config({tex2jax: {inlineMath: [[\'$\',\'$\'], [\'\\\\(\',\'\\\\)\']]}});</script><script src=\'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS_CHTML\'></script>',x,'</p>'])))
In our problem, we will only work with pre-collected datasets, and won't discuss this step at all. It is, however, vital to understand that this step is an integral part of the process and in many cases will be evolving over time to better fit the needs of the system.
Let us take a look at the dataset. In many cases, such as this one, it is convenient to store the data as a table (or a matrix) where:
We will start by loading the data and taking a look at it by printing out the 10 first rows.
data_file = 'https://technion046195.github.io/semester_2019_spring/datasets/titanic_manifest.csv'
## Loading the data
dataset = pd.read_csv(data_file)
## Print the number of rows in the data set
number_of_rows = len(dataset)
print_math('Number of rows in the dataset: $N={}$'.format(number_of_rows))
## Show the first 10 rows
dataset.head(10)
For simplicity, in this exercise will limit ourselves to using only these three fields:
(A full description for each of the other columns can be found here)
In this step, we will define the process which generates the data, usually in a probabilistic manner. For now, we will choose to describe the process as some back box which randomly spits out pairs of some passenger parameters and an indicator of whether this passenger has survived or not. An important assumption we make on the process is that different outcomes are statistically independent.
This process of randomly generating passengers and survival outcome can be thought of as the random chain of events which lead certain passengers to board the Titanic and the event which eventually resulted in whether or not they have survived.

We assume that the dataset was generate by apply this process $N$ times, with $N$ being the size of the dataset.
As stated before, in this problem we would like to come up with a method, for guessing, or predicting whether or not a passenger on board the Titanic has survived the crash based on his properties.
i.e., we are looking for a function which maps from the input space of gender and class, $\boldsymbol{x}$, into the binary space of the survival indicator, $y$:
$$ \hat{y}=h\left(\boldsymbol{x}\right) $$Where we have defined $\hat{y}$ as our prediction for the input $\boldsymbol{x}$.
We will later in this course define these type of problems as binary classification problems.
We still have not formally define what we mean by making a good prediction. In order to be able to pick the best prediction function, we must first define a way to evaluate different functions.
We would usually want to be able to assign a numeric score of how bad a prediction function performs and then strive to pick the function with the lowest score. We would call this function the risk function.
In this case, we will use a risk function called the misclassification rate. The misclassification rate is defined as the prediction errors a function makes on the data. Denoting:
The risk would be:
$$ R\left\{h, \left\{\boldsymbol{x},y\right\}\right\}=\frac{1}{N}\sum_i I\left\{h\left(\boldsymbol{x}_i\right)\neq y_i\right\} $$A comment about naming: In many places, this function appears under different names. Other common names for this function are the cost function, the error function or the loss function.
Although the name loss function is very commonly used, especially in deep learning, in our course, we will stick to the name risk and have a different definition for the term of a loss function.
We will be using our dataset for two different tasks:
This is a bit problematic, since in general, this will result in an optimistic result. The fact that our solution performs well on the given data does not necessarily mean that it will perform well on any new data.
Let us suppose that the names of the passengers are part of the input data. In this case, we could propose a prediction method which memorizes the list of surviving passengers and makes a prediction based on that list. This method will perform great on the given data, but it would fail for any new data.
A simple solution is to leave a portion of the dataset for the use of evaluation only. These two portions of the dataset are usually referred to as the train set and the test set.
A common practice is to use an 80% train-20% test split. In most cases, it will be important to split the data randomly independently of the order of the sample in the original dataset.
Let us prepare our dataset and split it to the train set and test set.
## Preparing the data set
## Constructing x_{i,j} and y_i. Here i runs over the passengers and j runs over [gender, class]
x = dataset[['numeric_sex', 'pclass']]
y = dataset['survived']
n_samples = len(x)
## Generate a random generator with a fixed seed (this is important to make our result reproducible)
rand_gen = np.random.RandomState(0)
## Generating a vector of indices
indices = np.arange(n_samples)
## Shuffle the indices
rand_gen.shuffle(indices)
## Split the indices into 80% train / 20% test
n_samples_train = int(n_samples * 0.8)
n_samples_test = n_samples - n_samples_train
train_indices = indices[:n_samples_train]
test_indices = indices[n_samples_train:]
## Split the data
x_train = x.iloc[train_indices]
x_test = x.iloc[test_indices]
y_train = y.iloc[train_indices]
y_test = y.iloc[test_indices]
## We could have directly shuffled and split the data, but in this way, we are still left with the original data and the indices which were used for the split which could be useful, especially for debugging.
In this step, we would like to suggest a family of optional solutions for our problem. We would then have to search the space of suggested solutions to select the best solution according to our evaluation method. I.e., we would search for the solution which produces in the minimal risk. We should note here that in many cases, finding the solution with the lowest risk is not possible and we would settle for the solution with the lowest risk we can find.
These solutions families will usually, but not always, be defined by a set of parameters. In such cases, we will use $\theta$ to denote these parameters.
We refer to this family of solutions as the model.
There are many consideration which come into account for selecting different models. We will point out three of them:
In our example we will suggest a model for the prediction function $h\left(\boldsymbol{x}\right)$. In fact in this simple example, since $h\left(\boldsymbol{x}\right)$ has only 6 possible inputs (2 gender x 3 classes), we can use 6 binary parameters to define the whole set of possible functions $h\left(\boldsymbol{x}\right)$.
$$ h_\boldsymbol{\theta}\left(\boldsymbol{x}\right)=\left\{ \begin{array}{ll} \theta_{0, 1} & \boldsymbol{x}=\left(0, 1\right) \\ \theta_{0, 2} & \boldsymbol{x}=\left(0, 2\right) \\ \theta_{0, 3} & \boldsymbol{x}=\left(0, 3\right) \\ \theta_{1, 1} & \boldsymbol{x}=\left(1, 1\right) \\ \theta_{1, 2} & \boldsymbol{x}=\left(1, 2\right) \\ \theta_{1, 3} & \boldsymbol{x}=\left(1, 3\right) \\ \end{array} \right. $$reminder: $\boldsymbol{x}=\left(\text{gender}, \text{class}\right)$
As a more visually appealing way, we can also write it in the form of a table:
| Sex \ Class | 1st class | 2nd Class | 3rd class |
|---|---|---|---|
| Male (0) | $\theta_{0,1}$ | $\theta_{0,2}$ | $\theta_{0,3}$ |
| Female (1) | $\theta_{1,1}$ | $\theta_{1,2}$ | $\theta_{1,3}$ |
Only for demonstrating the process, we will start with an even more simplified model for $h_\boldsymbol{\theta}\left(\boldsymbol{x}\right)$. We will start with the family of constant functions, i.e., $h_\theta\left(\boldsymbol{x}\right)=\theta$
| Sex \ Class | 1st class | 2nd Class | 3rd class |
|---|---|---|---|
| Male (0) | $\theta$ | $\theta$ | $\theta$ |
| Female (1) | $\theta$ | $\theta$ | $\theta$ |
Our method for finding the constant function, or equivalently the $\theta$, which produces the lowest risk would be to test all possible options. Since we are talking about a binary prediction function, there are only 2 options: $h_{\theta=0}\left(\boldsymbol{x}\right)=0$ and $h_{\theta=1}\left(\boldsymbol{x}\right)=1$,
Here we will usually apply some fancy method for selecting the best method, in this case, we simply need to evaluate the risk for the two options of $\theta=0$ and $\theta=1$ and one which produces the lower result.
Formally we would like to find the optimal value of $\theta$, which we will denote as $\theta^*$, for which:
$$ \theta^* =\underset{\theta\in\left\{0,1\right\}}{\arg\min}\ R\left\{h_{\theta}, \left\{\boldsymbol{x} ,y\right\}\right\} =\underset{\theta\in\left\{0,1\right\}}{\arg\min}\ \frac{1}{N}\sum_i I\left\{\theta\neq y_i\right\} $$Let us calculate the risk for each $\theta$ (note that we will only be using the train set for this task) :
## Loop over the two possible theta
print('The train risk for each predictor is:')
for theta in [0, 1]:
## The number of worng prediction for theta:
predictions = theta
train_risk = (y_train.values != predictions).mean()
print_math('- $R_\\text{{train}}\\{{ h_{{ \\theta={} }} \\}}={:.2}$'.format(theta, train_risk))
In this case, constantly predicting zero performs slightly better than constantly predicting one.
This is due to the fact that the majority of passengers did not survive the crash, therefore without knowing any details about the passenger we have a better change predicting that he did not survive.
Our proposed prediction function would be: $$ h\left(\boldsymbol{x}\right)=0\quad\forall\boldsymbol{x} $$
As stated before would like to do the evaluating of the risk on the test set. Let us calculate the risk using the test set.
## The evaluation of the final risk
predictions = 0
test_risk = (y_test.values != predictions).mean()
print_math('The test risk is: $R_\\text{{test}}\\{{ h_{{ \\theta=0 }} \\}}={:.2}$'.format(test_risk))
We would now want to suggest a way to improve our learning method.
In this case, the obvious way to get improvement is by replacing the naive model which we have used. Let us go back to the model suggestion stage.
Let us now return to the full model for $h_\boldsymbol{\theta}\left(\boldsymbol{x}\right)$ which covers all the possible all possible $2^6$ combinations for selecting $\boldsymbol{\theta}=\left(\theta_{0,1},..,\theta_{1,3}\right)^T$.
It could be shown that in this case, that we can find the optimal selection of the $\theta_{m,n}$'s by looking at each one of them individually and minimize the risk on the group of passengers with $X=\left(m,n\right)$. I.e.
$$ \theta_{m,n}^* =\underset{\theta_{m,n}\in\left\{0,1\right\}}{\arg\min}\ R\left\{h_{\boldsymbol{\theta}}, \left\{\boldsymbol{x}_i ,y_i:\boldsymbol{x}_i=\left(m,n\right)\right\}\right\} =\underset{\theta_{m,n}\in\left\{0,1\right\}}{\arg\min}\ \frac{1}{N_{m,n}}\sum_{i,\boldsymbol{x}_i=\left(m,n\right)} I\left\{\theta_{m,n}\neq y_i\right\} $$with $m\in\left\{0,1\right\}$ (the gender) and $n\in\left\{1,2,3\right\}$ (the class)
print('The train risk for each group is:')
## loop over the gender
for gender in [0, 1]:
## loop over the class
for class_ in [1, 2, 3]: # we have used "class_" since the word "class" is already in use by python
print('') # An empty line
print_math('## For $\\{{\\boldsymbol{{x}}_i,y_i:\\boldsymbol{{x}}_i=({},{}) \\}}$'.format(gender, class_))
## Loop over the two possible theta
for theta in [0, 1]:
## The number of worng prediction for theta:
predictions = theta
indices = (x_train['numeric_sex'].values == gender) & (x_train['pclass'].values == class_)
train_risk = (y_train.values[indices] != predictions).mean()
print_math('-- $\\theta_{{ {},{} }}={} \Rightarrow R_{{\\text{{train}}}}\\{{h_{{ \\boldsymbol{{\\theta}} }}\\}}={:.2f}$'.format(gender, class_, theta, train_risk))
Therefore our optimal predictor $\boldsymbol{\theta}^*$ will be constructed by choosing for each $\theta_{m,n}$ the values which minimizes the risk:
| Sex \ Class | 1st class | 2nd Class | 3rd class |
|---|---|---|---|
| Male (0) | 0 | 0 | 0 |
| Female (1) | 1 | 1 | 0 |
Let us calculate the risk for this prediction function
## The optimal predictor
## We will define a prediction function which recives a row in the
## dataset as an input and outputs a predition
def row_predictor(row):
gender = row['numeric_sex']
class_ = row['pclass']
prediction_map = {
(0, 1): 0,
(0, 2): 0,
(0, 3): 0,
(1, 1): 1,
(1, 2): 1,
(1, 3): 0,
}
prediction = prediction_map[(gender, class_)]
return prediction
## Apllying the predicion function to every line in the table
predictions = x_test.apply(row_predictor, axis='columns')
## The evaluation of the final risk
test_risk = (y_test.values != predictions).mean()
print_math('The test risk is: $R_\\text{{test}}\\{{ h_{{ \\boldsymbol{{\\theta}}^* }} \\}}={:.2f}$'.format(test_risk))
This mean that by using this predictor we have will be able to give a correct prediction about 77% of the time.
We have seen how we can use a dataset to build a model which we can now be used to make predictions on data we have not seen yet. We did so following a general workflow which we have described.
In almost all of the problems which we will solve in these workshops, we will follow this workflow. The parts which will change from one problem/solution to another is:
Icons in these notebooks were made by:
%%html
<link rel="stylesheet" href="../css/style.css"> <!--Setting styles - You can simply ignore this line-->