Logistic regression

Author

Alexandre Dauphin and Gorka Muñoz-Gil

1 Binary classification

In this section, we consider the classification task. In particular, we focus here on the binary classification, whcih means that a datapoint can either be in class \(0\) or in class \(1\).

Let us consider a concrete dataset, the toxity dataset. In this dataset, scientists were studying the toxity of a substance to a population of insects. They therefore administrated a serie of toxic doses and measured the number of deaths in the population of \(250\) individuals. The results of the study are shown in Figure 1.

n_ind = 250
data = dict(doses=range(1,7),deaths=[28,53,93,126,172,197])
df = pd.DataFrame.from_dict(data)
df['p_dying'] = df['deaths']/n_ind

Figure code

px.bar(data_frame=df, x='doses', y='deaths')

(a) Number of deaths with respect to the number of doses.

(b) ?(caption)

Figure 1: ?(caption)

Here, the goal of the classification would be to predict, given a number of dosis whether an individual would be death (class \(0\)) or alive (class \(1\)).

2 Logistic regression

As you can already guess from the title of this section, the goal here is to transform the classification problem, which is discrete and hard to optimize over, to a regression problem. For the latter, we already know what to do (see previous lessons!). In the current exampoe, we can define a probability distribution of dying for each value of the doses by dividing by the total number of individuals (here \(250\)). From here, we can translate the classification task into a regression fitting of this probability distribution. A common distribution considered in machine learning is the sigmoid function:

\[ \sigma(x) =\frac{1}{1+e^{-(ax+b)}}\]

from here, one can show that \[\log(\frac{\sigma(x)}{1-\sigma(x)})=ax+b\]

As in regression, our goal now is to find the appropriate parameters \(a,b\) such a to approximate the true distribution.

def sigmoid(x,a,b):
    return 1/(1 + np.exp(-(a*x+b)))

Figure 2 shows a trial of the fit for the values of \(a=0.5\) and \(b=-2\).

Figure code

a, b = 0.5, -2

vecd = np.arange(0,7,0.1)

Fig = go.Figure()


Fig.add_bar(x=df['doses'], y=df['p_dying'], name='toxity dataset')

Fig.add_trace(go.Scatter(name=f'sigmoid: a={a}, b={b}',x=vecd,y=sigmoid(vecd,a,b)))
Fig.update_layout(xaxis_title='doses',yaxis_title='p')

Figure 2: Probability of deaths with respect to the number of doses.

3 The dataset

Until now, we have considered the viewpoint of the whole population. But generally, the dataset is built from the data of each individual to which is associated a tuple \((x,y)\), where \(x\) is the number of dosis and \(y\) is the class (\(0\): dead, \(1\): alive). To this end, we construct a dataset with \(n_\text{ind}\times n_\text{doses}\) individuals.

for i, d in zip(df['doses'], df['deaths']):
    vec1 = np.zeros((n_ind,2))
    vec1[:,0], vec1[:d,1] = i, 1
    vec = vec1 if i ==1 else np.concatenate((vec,vec1))
    
np.random.shuffle(vec)
x, y = vec[:,0], vec[:,1]

print(x[:20])
print(y[:20])

[4. 2. 2. 1. 4. 4. 6. 5. 6. 1. 4. 3. 1. 6. 5. 6. 5. 4. 1. 4.]
[1. 0. 0. 0. 1. 1. 1. 1. 1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0.]

4 Loss function: Binary Cross Entropy

We now want to automatically find the values of \(a\) and \(b\). To this end, we have to define a loss function. We will here use the binary cross entropy

\[ BCE = -\frac{1}{N}\sum_{i=1}^N y_i \log [\sigma(x_i,a,b)] +(1-y_i)\log[1-\sigma(x_i,a,b)].\]

Let us have a look of the loss landscape of the binary cross entropy with respect to the paramters \(a\) and \(b\), as depicted in Figure 3. We actually observe that the landscape is convex for this choice of the Loss function. In this particular case, the latter can proven.

Code to generate the data for the plot

vec_a = np.arange(-5,5,0.1)
vec_b = np.arange(-5,5,0.1)
matz = np.zeros((vec_a.size,vec_b.size))

for i, a1 in enumerate(vec_a):
    for j, b1 in enumerate(vec_b):
        p = dict(a=a1, b=b1)
        matz[i,j] = BCE(x, y, sigmoid, params=p)

Figure code

fig = go.Figure()

fig.add_contour(z=matz,x=vec_b, y=vec_a,hovertemplate=
                    'a:%{y:.2f}'
                    +'<br>b:%{x:.2f}</br>'
                    +'f:%{z:.2f}<extra></extra>')


d = dict(width=600,
         height=600,
         xaxis={'title':'b'},
         yaxis={'title':'a'}
       )

fig.update_layout(d)
fig.show()

Figure 3: Binary Cross Entropy

Exercise

What is the value loss for your guess? Is it close to the minimum?

5 Gradient of the Loss function

Let us compute the gradient with respect to \(a\) and \(b\).

Exercise

What is the gradient of the sigmoid function \(\sigma(x)\)?

Solution

\(\sigma'(x) = \sigma(x) (1-\sigma(x))\)

We find

\[\begin{aligned} \partial_a BCE &= \frac{1}{N}\sum_{i=1}^N (\sigma_i -y_i)x_i\\ \partial_b BCE &= \frac{1}{N}\sum_{i=1}^N (\sigma_i -y_i),\\ \end{aligned}\]

where we have defined \(\sigma_i\equiv \sigma(ax_i+b)\).

Although the problem is convex, there is no easy way to derive an analytical closed form. This comes from the non-linearity introduced by the sigmoid. Libraries like scikit-learnpropose to use different algorithm such as e.g. conjugate gradient. In the next section, we will consider the stochastic gradient descent approach.

6 Stochastic gradient descent

Exercise

Implement the function grad_BCE(x, y, p) where p=dict(a=a, b=b) is a dictionary containing the the parameters \(a\) and \(b\). The function should return an array of the form np.array([grad_a, grad_b]).

### Your Code Here!
def grad_BCE(x, y, p):
    
    return np.array([grad_a, grad_b])

Solution

def grad_BCE(x:np.ndarray, # x data of N elements
             y:np.ndarray, # y data of N elements
             params: dict, # Parameters of the function
            )-> np.ndarray: # gradients
    '''Computes the gradient of the binary cross entropy loss function with respect to $a$ and $b$ and returns np.array([partial_a BCE, partial_b BCE])'''
    def sigmoid(x,a,b):
        return 1/(1 + np.exp(-(a*x+b)))
    a, b = params['a'], params['b']
    yp = sigmoid(x, a, b)
    ga, gb = np.mean((yp-y)*x), np.mean((yp-y))
    return np.array([ga,gb])

Having computed the gradients, we can now apply the stochastic gradient descent algorithm.

a0, b0 = 2, 1 

pini = dict(a=a0, b=b0)
ll = dict(loss=BCE, grads=grad_BCE, fun=sigmoid)

trackers = sgd(x, y, pini, ll, niter=int(1.5E2))

Figure 4 shows the trajectory of the stochastic gradient descent algorithm.

Figure code

amin, amax = np.min(trackers['a']), np.max(trackers['a'])
bmin, bmax = np.min(trackers['b']), np.max(trackers['b'])

n = 100
stepa, stepb =(amax-amin)/n, (bmax-bmin)/n

vec_a = np.arange(amin-19*stepa, amax+20*stepa, stepa)
vec_b = np.arange(bmin-19*stepb, bmax+20*stepb,stepb)
matz = np.zeros((vec_a.size,vec_b.size))

for i, a1 in enumerate(vec_a):
    for j, b1 in enumerate(vec_b):
        p = dict(a=a1, b=b1)
        matz[i,j] = BCE(x, y, sigmoid, params=p)
        
fig = go.Figure()
fig.add_contour(z=matz,x=vec_b, y=vec_a,
                hovertemplate=
                    'a:%{y:.2f}'
                    +'<br>b:%{x:.2f}</br>'
                    +'f:%{z:.2f}<extra></extra>')
mask = np.arange(0,len(trackers['a']),100)
veca, vecb, vecl = np.array(trackers['a'])[mask],np.array(trackers['b'])[mask], np.array(trackers['loss'])[mask]
fig.add_scatter(x=vecb, y=veca, name=f'SGD',text=vecl, mode='lines+markers',
                hovertemplate=
                'a:%{y:.2f}'
                +'<br>b:%{x:.2f}</br>'
                +'f:%{text:.2f}<extra></extra>')


fig.show()

Figure 4: Trajectory in the parameter space of the stochastic gradient descent algorithm.

7 Metrics

So far we have talked a lot about loss functions. Now we introduce a similar, but also very different object: metrics. While the goal of any ML training is to minimize both the loss functions (i.e. through SGD), and any metric of choice, there are some important differences:

Crucially, metrics are typically not differentiable, making them unsable for training a ML model.
However, metrics are usually more natural and simple to understand.

That is we train with a loss function, but we also keep an eye on one or two relevant metrics that will help us understand what is really going on with our model. Metrics are then used to hyperparametrized and analyze train model, but is also possible to keep track of metrics during training.

7.1 Accuracy

The most typical, and perhaps natural metric when dealing with classification problems is the accuracy, which we define as:

\[ \text{acc}= \frac{\# \text{correct predictions}}{\#\text{dataset}},\]

i.e. the ratio between the number of correct predictions and the number of elements in our training set. Compared to the BCE, a value of this metric is quite easy to undestand: close to 0 is really bad, close to 1 is really good.

veca, vecb, vecl = np.array(trackers['a']), np.array(trackers['b']), np.array(trackers['loss'])
mask = np.arange(0,len(veca),100)
veca, vecb, vecl = veca [mask], vecb[mask], vecl[mask]

Exercise

Write the function accuracy(x,y,a,b), which returns the accuracy for a given dataset and for the parameters \(a\) and \(b\). Choose the label with the following rule: \(0\) if \(\sigma_i<0.5\) else \(1\).

### Your Code Here!
def accuracy(x,y,a,b):
    
    return acc

Solution

def accuracy(x,y,a,b):
    yp = sigmoid(x, a, b)
    yp[yp>0.5] = 1
    yp[yp<=0.5] = 0
    return np.sum(yp==y)/y.size

Now that we can calculate the accuracy, let’s track it through training. Let’s us the save veca and vecb from above to avoid retraining:

vec_acc = np.zeros_like(veca)

for i,(a,b) in enumerate(zip(veca,vecb)):
    vec_acc[i] = accuracy(x,y,a,b)

Figure 5 shows the value of the Loss function and the accuracy during the training. It is interesting to note that while the Loss function is a smooth function, the accuracy is not smooth and that the variations of Loss function do not directly correspond to the variations of the accuracy. It is therefore important to check not only the Loss function but also the figure of merit of the problem, that can be for example the accuracy.

Figure code

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(y=vecl, name="Loss"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(y=vec_acc, name="Accuracy", visible = 'legendonly'),
    secondary_y=True,
)


# Set x-axis title
fig.update_xaxes(title_text="iterations")

# Set y-axes titles
fig.update_yaxes(title_text="Loss", secondary_y=False)
fig.update_yaxes(title_text="Accuracy", secondary_y=True)

fig.show()

Figure 5: Loss function vs Accuracy

7.2 Confusion matrix

While the accuracy is quite understandable, it may not give us enough information to understand why our model is actually failing. For instance, why is the previous accuracy not 1? Where is our model making a mistake? The confusion matrix is a very handy method for this. Before introducing it, we need to introduce few terms, the following table helps us for that:

	Positive (Prediction)	Negative (Prediction)
Positive (Ground Truth)	True Positive (TP)	False negative (FN)
Negative (Ground Truth)	False positive (FP)	True Negative (TN)

Now the confusion matrix is just the previous table, but in matrix form:

\[ C = \begin{bmatrix} \text{TP} & \text{FN} \\ \text{FP} & \text{TN} \end{bmatrix} \]

Let us now construct the confusion matrix for our model.

yp = sigmoid(x, a, b)
yp[yp>0.5] = 1
yp[yp<=0.5] = 0
cm = confusion_matrix(y,yp)

Figure 6 shows the confustion matrix for the logistic regression.

Figure code

X, Y = ["Alive(P)", "Dead(P)"], ["Alive(GT)", "Dead(GT)"]
fig = px.imshow(cm, x=X, y=Y, text_auto=True,color_continuous_scale='Blues')


fig.show()

Figure 6: Confusion matrix for the logistic regression

Exercise

How do you interpret the previous confusion metrics? What is the model doing wrong?

We readily observe that the accuracy correponds to the trace of this matrix normalized by the number of individuals.

print(f'{np.trace(cm)/np.sum(cm):.3f}')

0.714

7.3 Other metrics

There exist many other metrics that can be calculated from the previous. In particular, the following are commonly used:

Metric	Formula	Intuition
Precision	\(\frac{TP}{TP+FP}\)	Measures how many of the predicted positives are truly positive (focuses on correctness)
Recall	\(\frac{TP}{TP+FN}\)	Measures how many of the actual positives are correctly detected (focuses on completeness)
F1 score	\(\frac{2\text{Recall}\,\text{Precision}}{\text{Recall}+\text{Precision}}\)	Balances both by taking their harmonic mean, rewarding models that are both precise and sensitive

Exercise

Compute the different metrics for the trained logistic regression model. Bonus: compute it as a function of the training epochs.

### Your Code Here!

Solution

precision = cm[0,0]/(cm[0,0]+cm[1,0])

recall = cm[0,0]/(cm[0,0]+cm[0,1])

F1_score = 2*recall*precision/(precision+recall)

print(f'Precision: {precision:.3f}')

print(f'Recall: {recall:.3f}')

print(f'F1 score: {F1_score:.3f}')

Solution Bonus

vec_prec = np.zeros_like(veca)
vec_rec = np.zeros_like(veca)
vec_f1 = np.zeros_like(veca)

for i,(a,b) in enumerate(zip(veca,vecb)):

    yp = sigmoid(x, a, b)
    yp[yp>0.5] = 1
    yp[yp<=0.5] = 0
    cm = confusion_matrix(y,yp)

    if (cm.diagonal() == 0).any():
        continue

    vec_prec[i] = cm[0,0]/(cm[0,0]+cm[1,0])

    vec_rec[i] = cm[0,0]/(cm[0,0]+cm[0,1])
    
    vec_f1[i] = 2*vec_prec[i]*vec_rec[i]/(vec_prec[i]+vec_rec[i])

Figure code

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(y=vecl, name="Loss"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(y=vec_prec, name="Precision", visible = 'legendonly'),
    secondary_y=True,
)

fig.add_trace(
    go.Scatter(y=vec_rec, name="Recall", visible = 'legendonly'),
    secondary_y=True,
)

fig.add_trace(
    go.Scatter(y=vec_f1, name="F1-score", visible = 'legendonly'),
    secondary_y=True,
)


# Set x-axis title
fig.update_xaxes(title_text="iterations")

# Set y-axes titles
fig.update_yaxes(title_text="Loss", secondary_y=False)
fig.update_yaxes(title_text="Accuracy", secondary_y=True)

fig.show()

Figure 7: Loss function vs other metrics

Let us perform the same analysis on another dataset, the brest cancer dataset of scikit learn.

x1, y1 =load_breast_cancer(return_X_y=True)
x1 = x1[:,3]

print(x1[-10:])
print(y1[-10:])

[ 403.5  600.4  386.   716.9 1347.  1479.  1261.   858.1 1265.   181. ]
[1 1 1 0 0 0 0 0 0 1]

x1 = x1.reshape([np.size(x1),1])

clf = LogisticRegression().fit(x1,y1)
yp1 = clf.predict(x1)

cm = confusion_matrix(y1,yp1)

Figure code

X1, Y1 = ["Malignous(P)", "Benign(P)"], ["Malignous(GT)", "Benign(GT)"]
fig = px.imshow(cm, x=X1, y=Y1, text_auto=True,color_continuous_scale='Blues')


fig.show()

Figure 8: Confusion matrix for the logistic regression of the brest cancer

Exercise

Compute the different metrics for the trained logistic regression model.

Solution

precision = cm[0,0]/(cm[0,0]+cm[1,0])

recall = cm[0,0]/(cm[0,0]+cm[0,1])

F1_score = 2*recall*precision/(precision+recall)

print(f'Precision: {precision:.3f}')

print(f'Recall: {recall:.3f}')

print(f'F1 score: {F1_score:.3f}')

Precision: 0.892
Recall: 0.783
F1 score: 0.834