a_true, b_true = 1.5, 1
x, y = noisy_line(a_true, b_true, noise=[0,2])Linear regression
1 The task
In this introductory notebook, we discuss our first learning algorithm to perform a regression task. Given a dataset \(\{\mathbf{x},\mathbf{y}\}\) of \(n\) points, we would like to find the line \(y'=\mathbf{w}^{T} \mathbf{x} + \mathbf{b}\) that best fits the data. Therefore, let us start by generating such a dataset for the one dimnesional case. We do so by taking the line \(y=a^*x+b^*\) and adding gaussian noise to \(y\). We have prepared a small package lectures_ml with functionalities to do these tasks easily.
You can access the documentation of any function by pressing the tab key or by adding a ? after the function. You can also see the source code by adding ?? after the function. If you want them to appear in a cell of the notebook, you can use the function nbdev.showdoc() for the documentation and lectures_ml.utils.show_code().
Figure 1 shows the dataset \(\{x,y\}\). As expected, the dataset follows the linear relation dispersion (in red) but with some noise.
Code
fig = go.Figure()
fig.add_scatter(x=x, y=y, mode="markers", name='data',
hovertemplate='x:%{x:.2f}'
+'<br>y:%{y:.2f}</br><extra></extra>')
x1 = np.array([x.min(),x.max()])
y1 = a_true*x1+b_true
fig.add_scatter(x=x1, y=y1, mode="lines",name='Ground Truth')
fig.update_layout(width=800,height=400,xaxis={'title':'x'},yaxis={'title':'y'})
fig.show()2 Learning as an optimization problem
The goal of the learning task is to find the slope and the intercept of the line directly from the data. Therefore, we have to define a suitable model to solve the task with the given data. In general, the model is a function of the input data, \(f(\mathbf{x})\), whose output is interpreted as a prediction for the input data. We start by declaring a certain parametrization of a model (function), e.g., \(f(\mathbf{x}) = \mathbf{w}^{T} \mathbf{x} + \mathbf{b}\), with \(\theta \supset \{\mathbf{w}, \mathbf{b}\}\) denoting the model parameters. Then, all possible parametrizations of this function form the set of functions, i.e., the hypothesis class. Given that both \(x\) and \(y\) are one-dimensional in our example, let’s consider \(f_\theta(\mathbf{x}) = a x + b\) where \(a\) and \(b\) are real numbers too.
Machines ‘’learn’’ by minimizing a loss function of the training data, i.e., all the data accessible to the ML model during the learning process. The minimization is done by tuning the parameters of the model. We need to choose the loss function according to the objective task, although there is certain freedom on how to do it. In general, the loss function compares the model predictions or a developed solution against the reality or expectations. Therefore, learning becomes an optimization problem.
Here, we use the terms of loss, error, and cost functions 1 interchangeably following Ref. (Goodfellow, Bengio, and Courville 2016). Popular examples of loss functions include the mean square error and the cross entropy, used for supervised regression and classification 2 problems.
3 The loss function: Mean square error
Having a model, we now have to define a loss function for our regression task. For this case, we choose the mean square error, defined as \[MSE=\frac{1}{N}\sum_{i=1}^{N}(y_i'-y_i)^2.\] Such a loss measures the mean vertical distance between the dataset and the line \(y'=w_1 x + w_0\) (see Figure 2).

There is not a unique loss function suitable for our task. We could have chosen other losses such as, e.g., the Mean Absolute Error (MAE) or the Root Mean Squared Error (RMSE). The choice of the loss really depends on the problem and the dataset.
Let us now study the loss function in terms of its two parameters \(\{a,b\}\) for our dataset \(\{x,y\}\). Figure 3 shows the contour plot of the logarithm of loss function in terms of \(a\) and \(b\). We can clearly see that the minimum appears at the expected values of the line we generated in the previous section.
Code generating the data of the figure
vec_a = np.arange(-5,5,0.1)
vec_b = np.arange(-5,5,0.1)
matz, matzg = np.zeros((vec_a.size,vec_b.size)), np.zeros((vec_a.size,vec_b.size,2))
vec = np.zeros((vec_a.size*vec_b.size,3))
for i, a1 in enumerate(vec_a):
for j, b1 in enumerate(vec_b):
matz[i,j] = MSE(x,y,lambda x:a1*x+b1)
matzg[i,j,:] = grad_MSE_lr(x,y,dict(a=a1,b=b1))Code
fig = go.Figure()
fig.add_contour(z=np.log(matz),x=vec_b, y=vec_a,hovertemplate=
'a:%{y:.2f}'
+'<br>b:%{x:.2f}</br>'
+'f:%{z:.2f}<extra></extra>')
fig.add_scatter(x=[b_true],y=[a_true], marker_color='White')
d = dict(width=600,
height=600,
xaxis={'title':'b'},
yaxis={'title':'a'}
)
fig.update_layout(d)
fig.show()4 Finding the minimum of the loss function
In the case of the mean square error, we can derive analytically the optimal values of \(a\) and \(b\). To this end, we start by writing the gradients \[ \begin{align} &\partial_a MSE=\frac{2}{N}\sum_{i=1}^{N}(y_i'-y_i)x_i\\ &\partial_b MSE=\frac{2}{N}\sum_{i=1}^{N}(y_i'-y_i). \end{align} \]
This leads to the linear system of equations for \(a\) and \(b\) when the gradients vanish \[ \begin{align} &a \sum_{i=1}^N x_i^2+b \sum_{i=1}^N x_i - \sum_{i=1}^N y_i x_i =0\\ &a \sum_{i=1}^N x_i+b N -\sum_{i=1}^N y_i =0 \end{align} \]
We can easily solve this system of equation to find
\[ \begin{align} & b = \bar{y} - a \bar{x}\\ & a = \frac{\sum_{i=1}^N (x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^N(x_i-\bar{x})^2}, \end{align} \] where \(\bar{x}\) (\(\bar{y}\)) stands for the mean of \(x\) (\(y\)). As this problem is convex, we have found the unique global minimum.
Implement a function linear_regression_analytic(x,y) to compute the analytical optimal values for the slope and intercept given a dataset with samples x and y, such as the one we have created above.
### Your Code Here!
def linear_regression_analytic(x,y):
pass
estimate_a, estimate_b = linear_regression_analytic(x,y)
print(f'a={estimate_a:.3f}\nb={estimate_b:.3f}')Solution
def linear_regression_analytic(x,y):
xb, yb = np.mean(x), np.mean(y)
a = np.sum((x-xb)*(y-yb))/np.sum((x-xb)**2)
b = yb - a*xb
return a,bWe have just performed our first learning task!
5 Gradient Descent
In general, we do not have a tractable closed expression for the optimal parameters and we need to solve the optimization task through other means. Here, we introduce gradient-based approaches, which, despite not being needed for this task, it will allow us to introduce important concepts that will appear in a more abstract form in neural networks.
Let us first study the gradients. Figure 4 shows the gradients of the MSE with respect to \(a\) and \(b\). The values of \(a\) and \(b\) of the line lie in the zero contour lines of the gradients.
Code
for i in range(2):
mat = matzg[:, :, i]
vmax = np.abs(mat).max() # symmetric range around 0
fig = go.Figure()
fig.add_contour(
z=mat,
x=vec_b,
y=vec_a,
colorscale='RdBu', # diverging colormap centered on zero
zmin=-vmax,
zmax=vmax,
colorbar_title="Value"
)
fig.add_scatter(x=[b_true], y=[a_true], marker_color='white')
fig.update_layout(
xaxis_title='b',
yaxis_title='a'
)
fig.show()We can now perform a gradient optimization. The simplest one is the gradient descent algorithm (often called steepest descent algorithm). This iterative algorithms works as follows:
Implement the previous pseucode code to find the minimum of \(f(x)=x^2\). This convex function has a unique global minimum at \(x=0\) and we can compute its gradient analitically.
# Here are the functions we will use
def f(x): return x**2
def grad_f(x): return 2*xGiven the initial \(x_0\), perform perform n_iter iterations of the gradient descent algorithm.
### Your Code Here!
def gd_step(x0, grad_func):
passCode
# Solution
def gd_step(x0, grad_func):
x1 = x0 - eta* grad_func(x0)
return x1Once you have your gradient step ready, put it to the text by creating a loop that performs the pseudocode higher up. Keep track of the values of \(x\) and \(f(x)\) to see how they evolve. Do 20 iterations of GD.
#### Your Code Here!Code
# Solution
n_iter = 20
x0 = 2
eta = 1E-1
# keep track of the value of X
vecx = np.zeros(n_iter+1)
# And aslo the value of the function
vecf = np.zeros(n_iter+1)
vecx[0] = x0
vecf[0] = f(x0)
for i in np.arange(n_iter):
vecx[i+1] = gd_step(vecx[i], grad_f)
vecf[i+1] = f(vecx[i+1])Code
# Solution
fig = go.Figure()
x1 = np.arange(-2.5,2.51,0.01)
y1 = f(x1)
fig.add_scatter(x=x1, y=y1, mode="lines",name='Parabola',marker_color='#EF553B', visible='legendonly')
fig.add_scatter(x=vecx, y=vecf, mode="markers", name='GD',
hovertemplate='x:%{x:.2f}'
+'<br>y:%{y:.2f}</br><extra></extra>',marker_color='#636EFA',marker_size=8, visible='legendonly')
fig.update_layout(width=800,height=400,xaxis={'title':'x'},yaxis={'title':'f(x)'})
fig.show()Figure 5 shows a nice convergence of the algorithm to the global minimum \(x=0\).
Let us now come back to our linear regression problem. We consider n_ini random initial values for our parameters and run the gradient descent algortihm. Rather than writing the whole algorithm again, we use the gradient_descent function from the lectures_ml library.
n_ini = 5
veca0 = np.random.uniform(low=vec_a[1], high=vec_a[-2], size=n_ini)
vecb0 = np.random.uniform(low=vec_b[1], high=vec_b[-2], size=n_ini)
ll = dict(loss=MSE, grads=grad_MSE_lr, fun=line)
df = pd.DataFrame(columns=['a','b','label','value'])
for i in range(n_ini):
pini = dict(a=veca0[i],b=vecb0[i])
trackers = gradient_descent(x, y, pini, ll, niter=int(1E4), eta=1E-3)
df1 = pd.DataFrame(data={'a':trackers['a'],'b':trackers['b'],'label':f'traj {i+1}','value':trackers['loss']})
df = pd.concat([d.dropna(axis=1, how="all") for d in (df, df1)])Figure 6 depicts the loss functions in terms of the epochs for the different trajectories. The initial value of the loss function strongly varies depending on the initial conditions.However, we observe that the steepest descent algorithm drives rapidly the parameters towards the minimum.
Code
fig = px.scatter(df, y='value',animation_frame='label')
fig["layout"].pop("updatemenus") # optional, drop animation buttons
fig.update_layout(xaxis_title='epochs',yaxis_title='Loss')
fig.show()In ML it is usually much illustrative to see the evolution of the loss function in a log-scale:
Code
fig = px.scatter(df, y='value',animation_frame='label')
fig["layout"].pop("updatemenus") # optional, drop animation buttons
fig.update_layout(xaxis_title='epochs',yaxis_title='Loss',
yaxis_type='log', xaxis_type='log' )
fig.show()Figure 8 shows the trajectories in the parameter space.
Code
fig = go.Figure()
fig.add_contour(z=np.log(matz),x=vec_b, y=vec_a,
hovertemplate=
'a:%{y:.2f}'
+'<br>b:%{x:.2f}</br>'
+'f:%{z:.2f}<extra></extra>')
hovertemplate ='a:%{a}'+'b:%{b}<extra></extra>'
for i in range(n_ini):
visible = True if i == 0 else 'legendonly'
newdf = df[df.label == f'traj {i+1}']
fig.add_scatter(x=newdf.b, y=newdf.a, name=f'traj {i+1}',text=newdf.value,
hovertemplate=
'a:%{y:.2f}'
+'<br>b:%{x:.2f}</br>'
+'f:%{text:.2f}<extra></extra>', visible=visible)
legend=dict(
yanchor="top",
y=1.3,
xanchor="left",
x=0.1
)
d = dict(width=800,
height=600,
xaxis={'title':'b'},
yaxis={'title':'a'},
legend = legend
)
fig.update_layout(d)
fig.show()6 Choosing a Learning rate
Choosing a learning rate has an impact on convergence to the minimum, as depicted in Figure 9.
- If the learning rate is too small, the training needs many epochs.
- The right learning rate allows for a fast convergence to a minimum and needs to be found.
- If the learning rate is too large, optimization can take you away from the minimum (you ``overshoot’’).

Let us first illustrate the latter on the parabola example.
treshold = 1E-6 # Minimum difference between f_t and f_t+1 at which we stop the iterations
imax = int(1E4) # Maximum number of iterations
# Initial guess
x0 = 2
# Learning rate
eta = 1E-3
# Saving the info
vecx, vecf = [x0], [f(x0)]
x1=x0
i = 0
dl = 10
while dl>treshold and i<imax:
i = i+1
x1 = x1 - eta* grad_f(x1)
vecx.append(x1)
vecf.append(f(x1))
dl = np.abs(vecf[-1]-vecf[-2])
if vecf[-1]>1000.: breakCode
fig = go.Figure()
x1 = np.arange(-2.5,2.51,0.01)
y1 = x1**2
fig.add_scatter(x=x1, y=y1, mode="lines",name='Parabola',marker_color='#EF553B')
fig.add_scatter(x=vecx, y=vecf, mode="lines+markers", name='GD',
hovertemplate='x:%{x:.2f}'
+'<br>y:%{y:.2f}</br><extra></extra>',marker_color='#636EFA',marker_size=8)
fig.update_layout(width=800,height=400,xaxis={'title':'x'},yaxis={'title':'f(x)'},title=f'number of iterations to reach the threshold {treshold:.0e}: {i}')
fig.show()Rerun the last experiment for \(\eta=10^{-3},10^{-1},1.1\). What do you see?
We now perform a similar analysis for the linear regression problem. To this end, we choose a vector of learning rates vec_eta for the same initial condition and we apply the steepest descent algorithm.
Code
vec_eta = [1E-4,1E-3,1E-2,2E-2,3E-2,5E-2,1E-1]
n_ini = len(vec_eta)
pini = dict(a=-1.8, b=1)
df = pd.DataFrame(columns=['a','b','label','value'])
for i in range(n_ini):
trackers = gradient_descent(x, y, pini, ll, niter=int(1E4),eta=vec_eta[i])
df1 = pd.DataFrame(data={'a':trackers['a'],'b':trackers['b'],'label':f'traj {i+1}','eta':vec_eta[i],'value':trackers['loss']})
df = pd.concat([d.dropna(axis=1, how="all") for d in (df, df1)])Code
fig = go.Figure()
fig.add_contour(z=np.log(matz),x=vec_b, y=vec_a,
hovertemplate=
'a:%{y:.2f}'
+'<br>b:%{x:.2f}</br>'
+'f:%{z:.2f}<extra></extra>')
hovertemplate ='a:%{a}'+'b:%{b}<extra></extra>'
for i in range(n_ini):
visible = 'legendonly'
newdf = df[df.label == f'traj {i+1}']
fig.add_scatter(x=newdf.b, y=newdf.a, name=f'eta = {vec_eta[i]}',text=newdf.value,
hovertemplate=
'a:%{y:.2f}'
+'<br>b:%{x:.2f}</br>'
+'f:%{text:.2f}<extra></extra>',
visible=visible)
legend=dict(
yanchor="top",
y=1.3,
xanchor="left",
x=0.01
)
d = dict(width=800,
height=600,
xaxis={'title':'b'},
yaxis={'title':'a'},
legend = legend,
xaxis_range=[vec_b[1], vec_b[-1]],
yaxis_range=[vec_a[1], vec_a[-1]]
)
fig.update_layout(d)
fig.show()7 Non-convex problems
For convex cases as the one above, the gradient descent algorithm is guaranteed to converge to the global minimum for sufficiently small \(\eta\). For non-convex problems, it can instead get stuck on local minima. Indeed, in practical ML trainings, we hardly ever reach the global optimum, but it is usually sufficient to reach a local one that is close enough. Let’s see a visual example of this:
def f_nc(x):
return (x+1)**2*(x-2)**2 + 2*x
def grad_f_nc(x):
return 2*(x+1)*(x-2)*(2*x-1) + 0.2We now proceed to do the same descent from two different points in the parameter space:
n_iter = 20
eta = 1E-2
# Point one: converges to local minima
x0 = 2.5
vecx = np.zeros(n_iter+1)
vecf = np.zeros(n_iter+1)
vecx[0] = x0
vecf[0] = f_nc(x0)
for i in np.arange(n_iter):
vecx[i+1] = gd_step(vecx[i], grad_f_nc)
vecf[i+1] = f_nc(vecx[i+1])
# Point two: converges to global minima
x0 = -1.4
vecx_div = np.zeros(n_iter+1)
vecf_div = np.zeros(n_iter+1)
vecx_div[0] = x0
vecf_div[0] = f_nc(x0)
for i in np.arange(n_iter):
vecx_div[i+1] = gd_step(vecx_div[i], grad_f_nc)
vecf_div[i+1] = f_nc(vecx_div[i+1])Code
fig = go.Figure()
x1 = np.arange(-2,3,0.01)
y1 = f_nc(x1)
fig.add_scatter(x=x1, y=y1, mode="lines",name='Parabola',marker_color='#EF553B')
fig.add_scatter(x=vecx, y=vecf, mode="markers", name='GD local minima',
hovertemplate='x:%{x:.2f}'
+'<br>y:%{y:.2f}</br><extra></extra>',marker_color='#636EFA',marker_size=8)
fig.add_scatter(x=vecx_div, y=vecf_div, mode="markers", name='GD true minima',
hovertemplate='x:%{x:.2f}'
+'<br>y:%{y:.2f}</br><extra></extra>',marker_color='#2ECC71',marker_size=8)
fig.update_layout(width=800,height=400,xaxis={'title':'x'},yaxis={'title':'f(x)'})
fig.show()This showcases the importance, in non-convex cases, which are most of ML cases, to perform multiple random initializations of our model / training, because:
- We may have not found the correct solution because of an “unlucky” start.
- We may have found the correct solution by luck, and restarting the training does not find it again. We refer here then to the “robustness” of the model. A robust model can function under any conditions.
8 Stochastic Gradient Descent
The gradient descent algorithm requires to pass through the whole training set to compute the gradient. However, in some cases, this can be quite costly. Imagine, for example, the case of linear regression with many variables and many training examples. To overcome this limitation, computer scientists have designed a stochastic alternative to gradient descent: the stochastic gradient descent (SGD).
While stochastic gradient descent is not very relevant for the case of the linear regression with two parameters, it will become very important in the case of neural networks. We here take the simplicity of the loss landscape of such model to illustrate the main properties of stochastic gradient descent.
The main idea behind stochastic gradient descent is to approximate the loss function of the training set by the gradient of a single or just few training samples. While, each gradient step is a relatively bad approximation, the random walk followed by the aglorithm eventually converges to the direction of the steepest descent. This can be intuitively seen by noting that the mean of the gradient of several training points is pointing towards the steepest descent.
We now have two extreme cases: the gradient descent algorithm with no stochasticiy and the stochastic gradient descent with full stochasticity. This version of the stochastic gradient descent can be very unstable and take extremely long times to converge. Thus, it is desirable to find a middle ground: minibacth gradient descent. In this case, rather than taking the gradient over a single training example, we consider a batch size \(BS\), i.e. the number of training samples in the stochastic gradient descent loop. This way, we obtain a better estimate of the gradient while preserving some of its stochasticity.
The pseudocode looks like:
We illustrate the stochastic gradient descent with the following code snippet for the same initial condition and for a minibatch of size BS=20.
n_ini = 5
pini = dict(a=2, b=1)
df = pd.DataFrame(columns=['a','b','label','value','niter'])
# Let's first consider the gradient descent as before
trackers = gradient_descent(x, y, pini, ll, niter=int(1E3))
df1 = pd.DataFrame(data={'a':trackers['a'],'b':trackers['b'],'label':f'GD','value':trackers['loss'],'niter':np.arange(len(trackers['a']))})
df = pd.concat([d.dropna(axis=1, how="all") for d in (df, df1)])
# And now consider instead SGD
for i in range(n_ini):
trackers = sgd(x,y, pini, ll, niter=int(1E2), bs = 20)
df1 = pd.DataFrame(data={'a':trackers['a'],'b':trackers['b'],'niter':np.arange(len(trackers['a'])),'label':f'traj {i+1}','value':trackers['loss']})
df = pd.concat([d.dropna(axis=1, how="all") for d in (df, df1)])Code
fig = px.line(df, y='value', markers=True, animation_frame='label')
fig["layout"].pop("updatemenus") # optional, drop animation buttons
fig.update_layout(xaxis_title='iterations', yaxis_title='Loss')
fig.show()Figure 13 depcits the loss function of the gradient descent and the stochastic gradient descent algorithm for different shufflings. While both algorithms converge to a similar value of the Loss function, we can nicely observe the fluctuations coming from the stochasticity of the minibatches3. The latter can be also seen in Figure 14. It is interesting to notice in that last figure that the stochastic gradient descent fuctuates more in the \(a\)-direction. This fact is well known for SGD and can be improved with more avolved algorithms such as momentum, nesterov or Adam.
Code
amin, amax = df.a.min()*0.8,df.a.max()*1.1
bmin, bmax = df.b.min()*0.8,df.b.max()*1.1
n = 100
vec_a = np.arange(amin, amax,(amax-amin)/n)
vec_b = np.arange(bmin, bmax,(bmax-bmin)/n)
matz = np.zeros((vec_a.size,vec_b.size))
for i, a1 in enumerate(vec_a):
for j, b1 in enumerate(vec_b):
params = dict(a=a1, b=b1)
matz[i,j] = MSE(x,y,line,params)
fig = go.Figure()
fig.add_contour(z=np.log(matz),x=vec_b, y=vec_a,
hovertemplate=
'a:%{y:.2f}'
+'<br>b:%{x:.2f}</br>'
+'f:%{z:.2f}<extra></extra>')
hovertemplate ='a:%{a}'+'b:%{b}<extra></extra>'
for i in range(n_ini):
visible = True if i == 0 else 'legendonly'
newdf = df[df.label == f'traj {i+1}']
fig.add_scatter(x=newdf.b, y=newdf.a, name=f'traj {i+1}',text=newdf.value, mode='lines+markers',
hovertemplate=
'a:%{y:.2f}'
+'<br>b:%{x:.2f}</br>'
+'f:%{text:.2f}<extra></extra>',
visible=visible)
newdf = df[df.label == f'GD']
fig.add_scatter(x=newdf.b, y=newdf.a, name=f'GD',text=newdf.value,
mode='lines', line={'dash': 'dash','color':'White'},
hovertemplate=
'a:%{y:.2f}'
+'<br>b:%{x:.2f}</br>'
+'f:%{text:}<extra></extra>')
legend=dict(
yanchor="top",
y=1.3,
xanchor="left",
x=0.01
)
d = dict(width=800,
height=600,
xaxis={'title':'b'},
yaxis={'title':'a'},
legend = legend
)
fig.update_layout(d)
fig.show()Rerun the last experiment with different minibatch sizes. What do you see?
We finish this section by observing how the line adjust to our dataset in terms of the iterations for the GD and SGD. The results are presented in Figure 15 for the gradient descent.
Code generating the data of the figure
i =1
label = 'GD'#f'traj {i+1}' #change it if you want to see the SGD trajectory
x1 = np.array([x.min(),x.max()])
newdf = df[df.label == label]
a, b, mse = newdf.a.to_numpy(), newdf.b.to_numpy(), newdf.value.to_numpy()
y1 = np.einsum('i,j->ij',a,x1)+np.tile(b,(2,1)).TCode
frames = [go.Frame(data=[go.Scatter(x=x1, y=y1[i,:],mode='lines')],layout=go.Layout(title_text=f'step:{i}, MSE:{mse[i]:.2f}')) for i in range(a.size)]
buttons = [dict(label="Play",method="animate",
args=[None, {"frame": {"duration": 100, "redraw": True},
"fromcurrent": True,
"transition": {"duration": 300,"easing": "quadratic-in-out"}}]),
dict(label="Pause",method="animate",
args=[[None], {"frame": {"duration": 0, "redraw": False},"mode": "immediate","transition": {"duration": 0}}]),
dict(label="Restart",method="animate",
args=[None])]
Fig = go.Figure(
data=[go.Scatter(x=x1, y= y1[0,:],mode='lines',name = 'line'),
go.Scatter(x=x, y=y, mode="markers", name='data',
hovertemplate='x:%{x:.2f}'
+'<br>y:%{y:.2f}</br><extra></extra>')],
layout=go.Layout(
xaxis=dict(range=[x.min()-2, x.max()+2], autorange=False),
yaxis=dict(range=[y.min()-2, y.max()+2], autorange=False),
updatemenus=[dict(
type="buttons",
buttons=buttons)]
),
frames= frames
)
Fig.show()References
Footnotes
The literature also uses the terms of criterion or cost, error, or objective functions. Their definitions are not very strict. Following (Goodfellow, Bengio, and Courville 2016): ‘’The function we want to minimize or maximize is called the objective function, or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function. In this book, we use these terms interchangeably, though some machine learning publications assign special meaning to some of these term’’. For example, loss function may be defined for a single data point, the cost or error function may be a sum of loss functions, so check the definitions used in each paper.↩︎
For classification, a~more intuitive measure of the performance could be, e.g., accuracy, which is the ratio between the number of correctly classified examples and the data set size. Note, however, that gradient-based optimization requires measures of performance that are smooth and differentiable. These conditions distinguish loss functions from evaluation metrics such as accuracy, recall, precision, etc.↩︎
Beware that the notion of iteration is different for gradient descent and for stochastic gradient descent. For the former, an iteration corresponds to an epoch (the whole training set), while for the latter it corresponds to a minibatch.↩︎