Gorka Muñoz-Gil, Marcin Płodzień and Borja Requena
In the next few lessons we will dive into the world of neural networks and deep learning. We will first start with the basics, understanding how a very simple neural network works and the different algorithms that exist for training them.
1 Sigmoid perceptron
In this section we will implement a training algorithm for a single perceptron with sigmoid activation function. A perceptron looks like this:
Figure code
import matplotlib.pyplot as pltfrom matplotlib.patches import Circledef plot_nn(V=5, H=1, r=0.12, spacing =4, figsize = (6,4), alpha =0.8): fig, ax = plt.subplots(figsize=figsize) ax.set_aspect('equal') ax.axis('off')# x positions for layers xs = {'visible': 0.0, 'hidden': 1.5, 'output': 3.0}# vertical spacing scales with radius spacing = spacing * r# y positions depend on radius and number of neurons y_vis = [(i - (V-1)/2) * spacing for i inrange(V)] y_hid = [(i - (H-1)/2) * spacing for i inrange(H)] y_out = [0.0]# Draw nodesdef draw_nodes(x, ys, face, edge):for y in ys: ax.add_patch(Circle((x, y), r, facecolor=face, edgecolor=edge, lw=2))# ax.text(x, max(ys)+0.6, ha='center', va='bottom', fontsize=12, weight='bold') draw_nodes(xs['visible'], y_vis, face='#e6f0ff', edge='#1f4acc') draw_nodes(xs['hidden'], y_hid, face='#e8ffe6', edge='#2a8f2a') draw_nodes(xs['output'], y_out, face='#fff0e6', edge='#c75e1a')# Connectionsfor y1 in y_vis:for y2 in y_hid: ax.plot([xs['visible']+r, xs['hidden']-r], [y1, y2], lw=1.3, alpha=alpha, c ='k', zorder =-1)for y2 in y_hid: ax.plot([xs['hidden']+r, xs['output']-r], [y2, y_out[0]], lw=1.5, alpha=alpha, c ='k', zorder =-1) ax.text(xs['visible'], min(y_vis)-0.5, r'input $x$', ha='center', fontsize=10) ax.text(xs['hidden'], min(y_hid)-0.5, r'hidden $z$', ha='center', fontsize=10) ax.text(xs['output'], -0.5, r'output $z$', ha='center', fontsize=10) ax.set_xlim(-0.6, 3.6) ax.set_ylim(min(y_vis[0], y_hid[0])-1, max(y_vis[-1], y_hid[-1])+1) plt.tight_layout() plt.show()plot_nn(V=5)
We consider a data set of \(n\) tuples \((x_i, y_i)\) that we denote as \(X\in\mathbb{R}^{n\times m}\) and \(Y\in\mathbb{R^n}\). Here, \(m\) denotes the number of features in our samples. The perceptron consists on:
A linear transformation \(f: \mathbb{R}^m \mapsto \mathbb{R}^h\) of the form \(z_i = x_i^T W + \mathbf{b}\,\). Here, \(W\in\mathbb{R}^{m\times h}\) are the weights, and \(\mathbf{b}\in\mathbb{R}^h\) are the biases.
A nonlinear transformation \(\hat{y}_i = \sigma(z_i)\), in this case: the sigmoid function \(\sigma(z_i) = \frac{1}{1 + e^{-z_i}}\).
1.1 Sigmoid perceptron update
The training of a perceptron consists on the iterative update of its parameters, \(W\) and \(\mathbf{b}\), in order to minimize the loss function \(L\). Here, we will consider the mean-squared error:
Here, we will consider a neural network with a single hidden layer and sigmoid activation functions.
Figure code
plot_nn(V=5, H =4)
2.1 Network definition
In a neural network with a single hidden layer, we have two perceptrons: one between the input and the hidden layer, and one between the hidden layer and the output.
The input layer has the same size as the number of features in our data, i.e., \(m\) neurons. Then, the hidden layer has \(h\) neurons, and the output layer has as many neurons as classes \(c\). In a regression task, \(c=1\) as we predict a single scalar. Thus, the first weigth matrix \(W_1\) has shape \(m\times h\) and , and the second weight matrix \(W_2\) has shape \(h\times c\). In this case, we only consider biases in the hidden layer \(\mathbf{b}_1\) which is a vector with \(h\) entries.
2.1.1 Feed-forward pass
Now, let’s go through the feed-forward pass of the training data \(X\in\mathbb{R}^{n\times m}\) through our network.
The input goes through the first linear layer \(\mathbf{h} \leftarrow X^TW_1 + \mathbf{b}_1\) with shapes \([n, m] \times [m, h] = [n, h]\)
Then, we apply the activation function \(\hat{\mathbf{h}} \leftarrow \sigma(\mathbf{h})\) with shape \([n, h]\)
Then, we apply the second linear layer \(\mathbf{g} \leftarrow \mathbf{h}^TW_{2}\) with shapes \([n, h] \times [h, c] = [n, c]\)
Finally, we apply the activation function \(\hat{\mathbf{y}} \leftarrow \sigma(\mathbf{g})\) with shape \([n, c]\)
2.1.2 Parameter update
We will use the MSE loss function denoted in matrix rerpesentation as \(L = \frac{1}{2n}||Y - \hat{Y}||^2\),
Hint 1: Operations in \((\hat{Y}-Y)Y(1-Y)\) are element-wise multiplications.
Hint 2: Operations in \(\hat{\mathbf{h}}(1-\hat{\mathbf{h}})\) are element-wise multiplications.
Hint 3: The resulting weight updates must have the same dimension as the weight matrices.
2.2 Universal approximation theorem
While the previous may seem a trivial model, it has been proven that this model is indeed a universal approximator. In particular, the following theorem proves it (informal version):
A neural network with a single hidden layer and enough hidden neurons, using a suitable nonlinear activation function, can approximate any continuous function on a compact domain (for example, any function on \([0,1]^n\)) as closely as we want.
This means that in principle we could use the previous model for anything, even replicate an LLM. But of course, the previous is a theoretical result, an such a model is not practical: the width would be enourmous and training it could be a nightmare… We will see how to remedy this later this course. For now, let’s see how to train such model with python.
3 Example task: handwritten digits with the MNIST dataset
We test the concepts introduced above using the MNIST dataset. MNIST stands for Modified National Institute of Standards and Technology and the dataset consists of \(28\times28\) images of handwritten digits. Here, we will perform a regression task trying to predict the value of the digit from the image.
We will use the following architecture:
Input layer with \(m = 28\times28 = 784\) neurons.
Hidden layer with \(h = 25\) neurons.
Ouptut layer with \(c = 1\) neuron.
Figure code
plot_nn(V=20, H =10, spacing =1, figsize = (6,5), alpha =0.3)
Let’s have a look at some examples to get a better idea about the task.
Code
k =10for i inrange(0,10): idx = np.where(y_train == i)[0] # find indices of i-digit fig, ax = plt.subplots(1, k, sharey=True)for j inrange(0, k): ax[j].imshow(x_train[idx[j]]) plt.show()
Before putting our images through the model, we first need to pre-process the data. We will:
Flatten the images
Normalize them \(\mathbf{x}_i \to \frac{\mathbf{x}_i - \text{mean}(\mathbf{x}_i)}{\text{std}(\mathbf{x})}\)
Because output of our network comes from a simgoid activation function in the range \((0, 1)\), we will bring the image labels \(y \in \{0,1,2,\dots,9\}\) to the \((0,1)\) range dividing by 10.
# NN parametersm = X_train.shape[1] # number of input neuronsh =21# number of hidden neuronsc =1# number of output neurons
Exercise
Using the theoretical expression we found above, compute the forward and backward passes for the single layer NN.
### Your code heredef sigmoid(x):'''Computes the sigmoid function of an input np.array'''return1/(1+np.exp(-x))def forward(x, w_1, b_1, w_2):"Forward pass through our neural network."return sigmoid(g), g, h_hat, hdef backward(x, y, y_pred, w_2, h_hat):"Backward pass through our neural network."return grad_w_1, grad_b_1, grad_w_2
Now, we can look at the performance over unseen data from the test set.
Code
Y_pred, _, _, _ = forward(X_test, w_1, b_1, w_2)loss_test =0.5*np.mean((Y_pred.squeeze() - Y_test)**2)print(f"The test loss is {loss_test:.5f}")
The test loss is 0.00699
The model seems to generalize fairly well, as the performance is comparable to the one obtained in the training set. Indeed, looking at the training losses, we see that the model is barely overfitting as there is almost no difference between the training and validation loss. This is mainly due to the simplicity of the model that we are considering.
We can also pretend for a moment that this is a classification task. This is definitely not how you would frame a classification problem, but we can assign prediction intervals to the MNIST labels and see how we would do.
As we can see, the accuracy matrix has quite diagonal structure! With an accuracy far beyond what we would obtain from a random guess! We see, however, that most errors occur between consecutive classes, which is mainly due to rounding errors from the imperfect regression.
Note
We reiterate that this is not the proper way to handle a classification task. This is just an academic experiment to get familiar with the perceptron and see that neural networks are just a bunch of affine transformations.
Let’s see if we can get a better understanding of the model by looking at some predictions:
Code
k =5for i inrange(10): idx = np.where(true_test == i)[0] # find indices of i-digit fig, ax = plt.subplots(1, k, sharey=True)for j inrange(k): title_string ="\n P: "+"{:01d}".format(pred_test[idx[j]])) ax[j].set_title(title_string) ax[j].imshow(X_test[idx[j], :].reshape(28, 28))# plt.show()
In this examples, we see more clearly that, indeed, most errors occur due to categories being close to each other. For instance, all the errors in the images with 6s are either 5s or 7s. This is one of the main reasons why classification problems are not framed this way, but rather we treat every class as an independent instance of the rest.
Exercise
Set initial biases to zero, and freeze its training. Check the change in the confusion matrix and accuracy.
Exercise
Check and compare the prediction accuracy and confusion matrix for weights and bias initialization taken from: 1. Uniform distribution [-0.5,0.5] 2. Uniform distribution [0,1] 3. Normal distribution \({\cal N}(0,1)\)
3.1 Activation functions
So far we have been using softmax \(\sigma(z) = \frac{1}{1+e^{-x}}\) activation function only. The other activation functions are:
0_lo8wlkwReDcXkts0.png
Loss function should be calculated accordignly to the given activation function!
4 Optimization algorithms
There are many different optimization algorithms that can be used to train neural networks, and choosing the proper algorithm is essential to obtain a performant and well-trained model (Goodfellow, Bengio, and Courville 2016).
In general, optimization algorithms can be divided into two categories: first-order methods, which only use the gradient of the loss function with respect to the model’s parameters, and second-order methods, which also use the second derivative (or Hessian matrix). Second-order methods can be more computationally expensive, but they may also be more effective in certain cases.
4.1 Stochastic gradient descent (SGD)
In the standard gradient descent, we compute the gradient of the cost function with respect to the parameters for the entire training dataset. In most cases, it is extremely slow and even intractable for datasets that don’t even fit in memory. It also doesn’t allow us to update our model online, i.e. with new examples on-the-fly.
In SGD gradient descent, we use mini-batches comprised of a few training samples, and the model’s parameters are updated based on the average loss across the samples in each mini-batch. This way, SGD is able to make faster progress through the training dataset, and it can also make use of vectorized operations, which can make the training process more efficient.
4.2 Momentum
Momentum optimization is an algorithm that can be used to improve SGD. It works by adding a fraction \(\gamma\) of the previous parameter update to the current one, which helps the model make faster progress in the right direction and avoid getting stuck in local minima. This fraction is called the momentum coefficient, and it is a hyperparameter that can be adjusted according to the problem.
The momentum algorithm accumulates a history of the past gradients and continues to move in their direction:
\[\begin{equation}
\begin{split}
g_t &= \frac{\partial L(\theta_{t-1})}{\partial \theta}\\
v_t &= \gamma v_{t-1} - \eta g_t \\
\theta &= \theta + v_t,
\end{split}
\end{equation}\] where \(t\) enumerates training epoch, \(\theta\) are the trainable parameters of the Neural Network, \(\gamma\) is the momentum coefficient and \(\eta\) is the learning rate.
The velocity \(v\) accumulates the gradient of the loss function \(L\); the larger \(\gamma\) with respect to \(\eta\), the more previous gradients affect the current direction. In the standard SGD algorithm, the update size depended on the gradient and the learning rate. With momentum, it also depends on how large and how aligned consecutive gradients are. In addition to speeding up training, momentum optimization can also help the model to generalize better to new data.
4.3 Adaptative Gradient (Adagrad)
Adaptative Gradient algorithm (Duchi, Hazan, and Singer 2011) is based on the idea of adapting the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters.
The AdaGrad algorithm works by accumulating the squares of the gradients for each parameter, and then scaling the learning rate for each parameter by the inverse square root of this sum. This has the effect of reducing the learning rate for parameters that have been updated frequently, and increasing the learning rate for parameters that have been updated infrequently.
\[\begin{equation}
\begin{split}
\Delta \theta &= - \frac{\eta}{\sqrt{diag( \epsilon\mathbb{1} + G_t )}} \odot g_t,\\
g_t &= \frac{\partial L(\theta_{t-1})}{\partial \theta}\\
G_t &= \sum_{\tau = 1}^{t} g_\tau g_\tau^T.
\end{split}
\end{equation}\] where \(\odot\) means element-wise multiplication. The \(\epsilon \ll 0\) is a regularizing parameter, preventing from division by 0.
Adagrad eliminates the need to manually tune the learning rate, i.e. initially \(\eta \ll 1\), and it is effectively adapted during training process. Algorithm is quite sensitive to the choice of the initial learning rate, and it may require careful tuning to achieve good results.
4.4 Adaptive Moment Estimation (Adam)
Adam algorithm (Kingma and Ba 2014) combines the ideas of momentum optimization and Adagrad to make more stable updates and achieve faster convergence.
Like momentum optimization, Adam uses an exponentially decaying average of the previous gradients to determine the direction of the update. This helps the model to make faster progress in the right direction and avoid oscillations. Like AdaGrad, Adam also scales the learning rate for each parameter based on the inverse square root of an exponentially decaying average of the squared gradients. This has the effect of reducing the learning rate for parameters that have been updated frequently, and increasing the learning rate for parameters that have been updated infrequently.
Adam uses Exponentially Modified Moving Average for gradients and its square:
\[\begin{equation}
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon}\hat{m}_t,
\end{equation}\] where \[\begin{equation}
\begin{split}
\hat{m}_t &= \frac{m_t}{1-\beta^t_1}\\
\hat{v}_t &= \frac{v_t}{1-\beta^t_2},
\end{split}
\end{equation}\] are bias-corrected first and second gradient moments estimates.
Authors suggest to set \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\eta = 10^{-8}\).
Exercise
Based on the full gradient loop we used above, implement the SGD algorithm over the MNIST dataset. A reminder of what the code should look like:
Set batch size, learning rate, epochs and any other training hyperparameter.
Loop over training epochs:
Loop over number of batches:
Sample a random batch from the training set. For instance: batch = X_train[np.random.randint(0, X_train.shape[0], batch_size)]
Forward pass, backward, network update.
Track the test and training loss
Bonus: Implement the Adam optimizer in the previous loop.
Solution (pseudocode)
# Define the batch size and calculate the number of batchesbatch_size =64number_batches =int(X_train.shape[0]/batch_size)# Let's do a single epoch:n_epoch =1# We loop over epochfor epoch in (range(n_epoch)):# We loop over the number of batchesfor _ in tqdm(range(number_batches)):# Sample randomly one batch idx_batch = np.random.randint(0, X_train.shape[0], batch_size) X_batch = X_train[idx_batch] Y_batch = Y_train[idx_batch] # Forward pass only over the batch Y_pred, g, h_hat, h = forward(X_batch, w_1, b_1, w_2) loss_train =0.5*np.mean((Y_pred.squeeze() - Y_batch)**2) loss_train_vs_epoch.append(loss_train)# Update parameters based on the loss of the current batch grad_w_1, grad_b_1, grad_w_2 = backward(X_batch, Y_batch, Y_pred, w_2, h_hat) w_1 -= eta/n_train*grad_w_1 b_1 -= eta/n_train*grad_b_1 w_2 -= eta/n_train*grad_w_2
References
Duchi, John, Elad Hazan, and Yoram Singer. 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.”J. Mach. Learn. Res. 12 (July): 2121–59. https://doi.org/10.5555/1953048.2021068.