n_ind = 250
data = dict(doses=range(1,7),deaths=[28,53,93,126,172,197])
df = pd.DataFrame.from_dict(data)
df['p_dying'] = df['deaths']/n_indLogistic regression
1 Binary classification
In this section, we consider the classification task. In particular, we focus here on the binary classification, whcih means that a datapoint can either be in class \(0\) or in class \(1\).
Let us consider a concrete dataset, the toxity dataset. In this dataset, scientists were studying the toxity of a substance to a population of insects. They therefore administrated a serie of toxic doses and measured the number of deaths in the population of \(250\) individuals. The results of the study are shown in Figure 1.
Figure code
px.bar(data_frame=df, x='doses', y='deaths')Here, the goal of the classification would be to predict, given a number of dosis whether an individual would be death (class \(0\)) or alive (class \(1\)).
2 Logistic regression
As you can already guess from the title of this section, the goal here is to transform the classification problem, which is discrete and hard to optimize over, to a regression problem. For the latter, we already know what to do (see previous lessons!). In the current exampoe, we can define a probability distribution of dying for each value of the doses by dividing by the total number of individuals (here \(250\)). From here, we can translate the classification task into a regression fitting of this probability distribution. A common distribution considered in machine learning is the sigmoid function:
\[ \sigma(x) =\frac{1}{1+e^{-(ax+b)}}\]
from here, one can show that \[\log(\frac{\sigma(x)}{1-\sigma(x)})=ax+b\]
As in regression, our goal now is to find the appropriate parameters \(a,b\) such a to approximate the true distribution.
def sigmoid(x,a,b):
return 1/(1 + np.exp(-(a*x+b)))Figure 2 shows a trial of the fit for the values of \(a=0.5\) and \(b=-2\).
Figure code
a, b = 0.5, -2
vecd = np.arange(0,7,0.1)
Fig = go.Figure()
Fig.add_bar(x=df['doses'], y=df['p_dying'], name='toxity dataset')
Fig.add_trace(go.Scatter(name=f'sigmoid: a={a}, b={b}',x=vecd,y=sigmoid(vecd,a,b)))
Fig.update_layout(xaxis_title='doses',yaxis_title='p')3 The dataset
Until now, we have considered the viewpoint of the whole population. But generally, the dataset is built from the data of each individual to which is associated a tuple \((x,y)\), where \(x\) is the number of dosis and \(y\) is the class (\(0\): dead, \(1\): alive). To this end, we construct a dataset with \(n_\text{ind}\times n_\text{doses}\) individuals.
for i, d in zip(df['doses'], df['deaths']):
vec1 = np.zeros((n_ind,2))
vec1[:,0], vec1[:d,1] = i, 1
vec = vec1 if i ==1 else np.concatenate((vec,vec1))
np.random.shuffle(vec)
x, y = vec[:,0], vec[:,1]print(x[:20])
print(y[:20])[4. 2. 2. 1. 4. 4. 6. 5. 6. 1. 4. 3. 1. 6. 5. 6. 5. 4. 1. 4.]
[1. 0. 0. 0. 1. 1. 1. 1. 1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0.]
4 Loss function: Binary Cross Entropy
We now want to automatically find the values of \(a\) and \(b\). To this end, we have to define a loss function. We will here use the binary cross entropy
\[ BCE = -\frac{1}{N}\sum_{i=1}^N y_i \log [\sigma(x_i,a,b)] +(1-y_i)\log[1-\sigma(x_i,a,b)].\]
Let us have a look of the loss landscape of the binary cross entropy with respect to the paramters \(a\) and \(b\), as depicted in Figure 3. We actually observe that the landscape is convex for this choice of the Loss function. In this particular case, the latter can proven.
Code to generate the data for the plot
vec_a = np.arange(-5,5,0.1)
vec_b = np.arange(-5,5,0.1)
matz = np.zeros((vec_a.size,vec_b.size))
for i, a1 in enumerate(vec_a):
for j, b1 in enumerate(vec_b):
p = dict(a=a1, b=b1)
matz[i,j] = BCE(x, y, sigmoid, params=p)Figure code
fig = go.Figure()
fig.add_contour(z=matz,x=vec_b, y=vec_a,hovertemplate=
'a:%{y:.2f}'
+'<br>b:%{x:.2f}</br>'
+'f:%{z:.2f}<extra></extra>')
d = dict(width=600,
height=600,
xaxis={'title':'b'},
yaxis={'title':'a'}
)
fig.update_layout(d)
fig.show()What is the value loss for your guess? Is it close to the minimum?
5 Gradient of the Loss function
Let us compute the gradient with respect to \(a\) and \(b\).
What is the gradient of the sigmoid function \(\sigma(x)\)?
\(\sigma'(x) = \sigma(x) (1-\sigma(x))\)
We find
\[\begin{aligned} \partial_a BCE &= \frac{1}{N}\sum_{i=1}^N (\sigma_i -y_i)x_i\\ \partial_b BCE &= \frac{1}{N}\sum_{i=1}^N (\sigma_i -y_i),\\ \end{aligned}\]where we have defined \(\sigma_i\equiv \sigma(ax_i+b)\).
Although the problem is convex, there is no easy way to derive an analytical closed form. This comes from the non-linearity introduced by the sigmoid. Libraries like scikit-learnpropose to use different algorithm such as e.g. conjugate gradient. In the next section, we will consider the stochastic gradient descent approach.
6 Stochastic gradient descent
Implement the function grad_BCE(x, y, p) where p=dict(a=a, b=b) is a dictionary containing the the parameters \(a\) and \(b\). The function should return an array of the form np.array([grad_a, grad_b]).
### Your Code Here!
def grad_BCE(x, y, p):
return np.array([grad_a, grad_b])Solution
def grad_BCE(x:np.ndarray, # x data of N elements
y:np.ndarray, # y data of N elements
params: dict, # Parameters of the function
)-> np.ndarray: # gradients
'''Computes the gradient of the binary cross entropy loss function with respect to $a$ and $b$ and returns np.array([partial_a BCE, partial_b BCE])'''
def sigmoid(x,a,b):
return 1/(1 + np.exp(-(a*x+b)))
a, b = params['a'], params['b']
yp = sigmoid(x, a, b)
ga, gb = np.mean((yp-y)*x), np.mean((yp-y))
return np.array([ga,gb])Having computed the gradients, we can now apply the stochastic gradient descent algorithm.
a0, b0 = 2, 1
pini = dict(a=a0, b=b0)
ll = dict(loss=BCE, grads=grad_BCE, fun=sigmoid)
trackers = sgd(x, y, pini, ll, niter=int(1.5E2))Figure 4 shows the trajectory of the stochastic gradient descent algorithm.
Figure code
amin, amax = np.min(trackers['a']), np.max(trackers['a'])
bmin, bmax = np.min(trackers['b']), np.max(trackers['b'])
n = 100
stepa, stepb =(amax-amin)/n, (bmax-bmin)/n
vec_a = np.arange(amin-19*stepa, amax+20*stepa, stepa)
vec_b = np.arange(bmin-19*stepb, bmax+20*stepb,stepb)
matz = np.zeros((vec_a.size,vec_b.size))
for i, a1 in enumerate(vec_a):
for j, b1 in enumerate(vec_b):
p = dict(a=a1, b=b1)
matz[i,j] = BCE(x, y, sigmoid, params=p)
fig = go.Figure()
fig.add_contour(z=matz,x=vec_b, y=vec_a,
hovertemplate=
'a:%{y:.2f}'
+'<br>b:%{x:.2f}</br>'
+'f:%{z:.2f}<extra></extra>')
mask = np.arange(0,len(trackers['a']),100)
veca, vecb, vecl = np.array(trackers['a'])[mask],np.array(trackers['b'])[mask], np.array(trackers['loss'])[mask]
fig.add_scatter(x=vecb, y=veca, name=f'SGD',text=vecl, mode='lines+markers',
hovertemplate=
'a:%{y:.2f}'
+'<br>b:%{x:.2f}</br>'
+'f:%{text:.2f}<extra></extra>')
fig.show()7 Metrics
So far we have talked a lot about loss functions. Now we introduce a similar, but also very different object: metrics. While the goal of any ML training is to minimize both the loss functions (i.e. through SGD), and any metric of choice, there are some important differences:
- Crucially, metrics are typically not differentiable, making them unsable for training a ML model.
- However, metrics are usually more natural and simple to understand.
That is we train with a loss function, but we also keep an eye on one or two relevant metrics that will help us understand what is really going on with our model. Metrics are then used to hyperparametrized and analyze train model, but is also possible to keep track of metrics during training.
7.1 Accuracy
The most typical, and perhaps natural metric when dealing with classification problems is the accuracy, which we define as:
\[ \text{acc}= \frac{\# \text{correct predictions}}{\#\text{dataset}},\]
i.e. the ratio between the number of correct predictions and the number of elements in our training set. Compared to the BCE, a value of this metric is quite easy to undestand: close to 0 is really bad, close to 1 is really good.
veca, vecb, vecl = np.array(trackers['a']), np.array(trackers['b']), np.array(trackers['loss'])
mask = np.arange(0,len(veca),100)
veca, vecb, vecl = veca [mask], vecb[mask], vecl[mask]Write the function accuracy(x,y,a,b), which returns the accuracy for a given dataset and for the parameters \(a\) and \(b\). Choose the label with the following rule: \(0\) if \(\sigma_i<0.5\) else \(1\).
### Your Code Here!
def accuracy(x,y,a,b):
return accSolution
def accuracy(x,y,a,b):
yp = sigmoid(x, a, b)
yp[yp>0.5] = 1
yp[yp<=0.5] = 0
return np.sum(yp==y)/y.sizeNow that we can calculate the accuracy, let’s track it through training. Let’s us the save veca and vecb from above to avoid retraining:
vec_acc = np.zeros_like(veca)
for i,(a,b) in enumerate(zip(veca,vecb)):
vec_acc[i] = accuracy(x,y,a,b)Figure 5 shows the value of the Loss function and the accuracy during the training. It is interesting to note that while the Loss function is a smooth function, the accuracy is not smooth and that the variations of Loss function do not directly correspond to the variations of the accuracy. It is therefore important to check not only the Loss function but also the figure of merit of the problem, that can be for example the accuracy.
Figure code
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])
# Add traces
fig.add_trace(
go.Scatter(y=vecl, name="Loss"),
secondary_y=False,
)
fig.add_trace(
go.Scatter(y=vec_acc, name="Accuracy", visible = 'legendonly'),
secondary_y=True,
)
# Set x-axis title
fig.update_xaxes(title_text="iterations")
# Set y-axes titles
fig.update_yaxes(title_text="Loss", secondary_y=False)
fig.update_yaxes(title_text="Accuracy", secondary_y=True)
fig.show()7.2 Confusion matrix
While the accuracy is quite understandable, it may not give us enough information to understand why our model is actually failing. For instance, why is the previous accuracy not 1? Where is our model making a mistake? The confusion matrix is a very handy method for this. Before introducing it, we need to introduce few terms, the following table helps us for that:
| Positive (Prediction) | Negative (Prediction) | |
|---|---|---|
| Positive (Ground Truth) | True Positive (TP) | False negative (FN) |
| Negative (Ground Truth) | False positive (FP) | True Negative (TN) |
Now the confusion matrix is just the previous table, but in matrix form:
\[ C = \begin{bmatrix} \text{TP} & \text{FN} \\ \text{FP} & \text{TN} \end{bmatrix} \]
Let us now construct the confusion matrix for our model.
yp = sigmoid(x, a, b)
yp[yp>0.5] = 1
yp[yp<=0.5] = 0
cm = confusion_matrix(y,yp)Figure 6 shows the confustion matrix for the logistic regression.
Figure code
X, Y = ["Alive(P)", "Dead(P)"], ["Alive(GT)", "Dead(GT)"]
fig = px.imshow(cm, x=X, y=Y, text_auto=True,color_continuous_scale='Blues')
fig.show()How do you interpret the previous confusion metrics? What is the model doing wrong?
We readily observe that the accuracy correponds to the trace of this matrix normalized by the number of individuals.
print(f'{np.trace(cm)/np.sum(cm):.3f}')0.714
7.3 Other metrics
There exist many other metrics that can be calculated from the previous. In particular, the following are commonly used:
| Metric | Formula | Intuition |
|---|---|---|
| Precision | \(\frac{TP}{TP+FP}\) | Measures how many of the predicted positives are truly positive (focuses on correctness) |
| Recall | \(\frac{TP}{TP+FN}\) | Measures how many of the actual positives are correctly detected (focuses on completeness) |
| F1 score | \(\frac{2\text{Recall}\,\text{Precision}}{\text{Recall}+\text{Precision}}\) | Balances both by taking their harmonic mean, rewarding models that are both precise and sensitive |
Compute the different metrics for the trained logistic regression model. Bonus: compute it as a function of the training epochs.
### Your Code Here!Solution
precision = cm[0,0]/(cm[0,0]+cm[1,0])
recall = cm[0,0]/(cm[0,0]+cm[0,1])
F1_score = 2*recall*precision/(precision+recall)
print(f'Precision: {precision:.3f}')
print(f'Recall: {recall:.3f}')
print(f'F1 score: {F1_score:.3f}')Solution Bonus
vec_prec = np.zeros_like(veca)
vec_rec = np.zeros_like(veca)
vec_f1 = np.zeros_like(veca)
for i,(a,b) in enumerate(zip(veca,vecb)):
yp = sigmoid(x, a, b)
yp[yp>0.5] = 1
yp[yp<=0.5] = 0
cm = confusion_matrix(y,yp)
if (cm.diagonal() == 0).any():
continue
vec_prec[i] = cm[0,0]/(cm[0,0]+cm[1,0])
vec_rec[i] = cm[0,0]/(cm[0,0]+cm[0,1])
vec_f1[i] = 2*vec_prec[i]*vec_rec[i]/(vec_prec[i]+vec_rec[i])Figure code
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])
# Add traces
fig.add_trace(
go.Scatter(y=vecl, name="Loss"),
secondary_y=False,
)
fig.add_trace(
go.Scatter(y=vec_prec, name="Precision", visible = 'legendonly'),
secondary_y=True,
)
fig.add_trace(
go.Scatter(y=vec_rec, name="Recall", visible = 'legendonly'),
secondary_y=True,
)
fig.add_trace(
go.Scatter(y=vec_f1, name="F1-score", visible = 'legendonly'),
secondary_y=True,
)
# Set x-axis title
fig.update_xaxes(title_text="iterations")
# Set y-axes titles
fig.update_yaxes(title_text="Loss", secondary_y=False)
fig.update_yaxes(title_text="Accuracy", secondary_y=True)
fig.show()Let us perform the same analysis on another dataset, the brest cancer dataset of scikit learn.
x1, y1 =load_breast_cancer(return_X_y=True)
x1 = x1[:,3]
print(x1[-10:])
print(y1[-10:])[ 403.5 600.4 386. 716.9 1347. 1479. 1261. 858.1 1265. 181. ]
[1 1 1 0 0 0 0 0 0 1]
x1 = x1.reshape([np.size(x1),1])
clf = LogisticRegression().fit(x1,y1)
yp1 = clf.predict(x1)
cm = confusion_matrix(y1,yp1)Figure code
X1, Y1 = ["Malignous(P)", "Benign(P)"], ["Malignous(GT)", "Benign(GT)"]
fig = px.imshow(cm, x=X1, y=Y1, text_auto=True,color_continuous_scale='Blues')
fig.show()Compute the different metrics for the trained logistic regression model.
Solution
precision = cm[0,0]/(cm[0,0]+cm[1,0])
recall = cm[0,0]/(cm[0,0]+cm[0,1])
F1_score = 2*recall*precision/(precision+recall)
print(f'Precision: {precision:.3f}')
print(f'Recall: {recall:.3f}')
print(f'F1 score: {F1_score:.3f}')Precision: 0.892
Recall: 0.783
F1 score: 0.834