NN from scratch

Author

Gorka Muñoz-Gil, Marcin Płodzień and Borja Requena

In the next few lessons we will dive into the world of neural networks and deep learning. We will first start with the basics, understanding how a very simple neural network works and the different algorithms that exist for training them.

1 Sigmoid perceptron

In this section we will implement a training algorithm for a single perceptron with sigmoid activation function. A perceptron looks like this:

Figure code

import matplotlib.pyplot as plt
from matplotlib.patches import Circle

def plot_nn(V=5, H=1, r=0.12, spacing = 4, figsize = (6,4), alpha = 0.8):
    fig, ax = plt.subplots(figsize=figsize)
    ax.set_aspect('equal')
    ax.axis('off')

    # x positions for layers
    xs = {'visible': 0.0, 'hidden': 1.5, 'output': 3.0}

    # vertical spacing scales with radius
    spacing = spacing * r

    # y positions depend on radius and number of neurons
    y_vis = [(i - (V-1)/2) * spacing for i in range(V)]
    y_hid = [(i - (H-1)/2) * spacing for i in range(H)]
    y_out = [0.0]

    # Draw nodes
    def draw_nodes(x, ys, face, edge):
        for y in ys:
            ax.add_patch(Circle((x, y), r, facecolor=face, edgecolor=edge, lw=2))
        # ax.text(x, max(ys)+0.6, ha='center', va='bottom', fontsize=12, weight='bold')

    draw_nodes(xs['visible'], y_vis, face='#e6f0ff', edge='#1f4acc')
    draw_nodes(xs['hidden'],  y_hid, face='#e8ffe6', edge='#2a8f2a')
    draw_nodes(xs['output'],  y_out, face='#fff0e6', edge='#c75e1a')

    

    # Connections
    for y1 in y_vis:
        for y2 in y_hid:
            ax.plot([xs['visible']+r, xs['hidden']-r], [y1, y2], lw=1.3, alpha=alpha, c = 'k', zorder = -1)

    for y2 in y_hid:
        ax.plot([xs['hidden']+r, xs['output']-r], [y2, y_out[0]], lw=1.5, alpha=alpha, c = 'k', zorder = -1)

    ax.text(xs['visible'], min(y_vis)-0.5, r'input $x$', ha='center', fontsize=10)
    ax.text(xs['hidden'],  min(y_hid)-0.5, r'hidden $z$', ha='center', fontsize=10)
    ax.text(xs['output'],  -0.5, r'output $z$', ha='center', fontsize=10)

    ax.set_xlim(-0.6, 3.6)
    ax.set_ylim(min(y_vis[0], y_hid[0])-1, max(y_vis[-1], y_hid[-1])+1)
    plt.tight_layout()
    plt.show()

plot_nn(V=5)

We consider a data set of \(n\) tuples \((x_i, y_i)\) that we denote as \(X\in\mathbb{R}^{n\times m}\) and \(Y\in\mathbb{R^n}\). Here, \(m\) denotes the number of features in our samples. The perceptron consists on:

A linear transformation \(f: \mathbb{R}^m \mapsto \mathbb{R}^h\) of the form \(z_i = x_i^T W + \mathbf{b}\,\). Here, \(W\in\mathbb{R}^{m\times h}\) are the weights, and \(\mathbf{b}\in\mathbb{R}^h\) are the biases.
A nonlinear transformation \(\hat{y}_i = \sigma(z_i)\), in this case: the sigmoid function \(\sigma(z_i) = \frac{1}{1 + e^{-z_i}}\).

1.1 Sigmoid perceptron update

The training of a perceptron consists on the iterative update of its parameters, \(W\) and \(\mathbf{b}\), in order to minimize the loss function \(L\). Here, we will consider the mean-squared error:

\[\begin{equation} \begin{split} L & = \frac{1}{N}\sum_{i=1}^N L_i\\ L_i & = \frac{1}{2}(y_i - \hat{y}_i)^2, \end{split} \end{equation}\]

where \(\hat{y}_i\) is our prediction and \(y_i\) the ground truth.

We update the weights \(W\) and biases \(\mathbf{b}\) with a gradient descent procedure:

\[\begin{equation} \begin{split} W & \leftarrow W - \eta \frac{\partial L}{\partial W} = W - \frac{\eta}{N} \sum_{i=1}^N \frac{\partial L_i}{\partial W}\\ \mathbf{b} & \leftarrow \mathbf{b} - \eta \frac{\partial L}{\partial \mathbf{b}} = W - \frac{\eta}{N} \sum_{i=1}^N \frac{\partial L_i}{\partial \mathbf{b}}, \end{split} \end{equation}\]

where \(\eta\) is the learning rate.

In this case, we can obtain analytical expressions for the gradients:

\[\begin{equation} \begin{split} \frac{\partial L_i}{\partial W} & = (y_i - \hat{y}_i)\frac{\partial y_i}{\partial W}\\ \frac{\partial L_i}{\partial \mathbf{b}} & = (y_i - \hat{y}_i)\frac{\partial \hat{y}_i}{\partial \mathbf{b}}\\ \hat{y}_i & = \sigma(z_i) \\ z_i & = x_i^T W + \mathbf{b} \end{split} \end{equation}\]

With the chain rule we have:

\[\begin{equation} \begin{split} \frac{\partial \hat{y}_i}{\partial W} & = \frac{\partial \sigma(z_i)}{\partial z_i}\frac{\partial z_i}{\partial W} \\ \frac{\partial \hat{y}_i}{\partial \mathbf{b}} & = \frac{\partial \sigma(z_i)}{\partial z_i}\frac{\partial z_i}{\partial \mathbf{b}} \end{split} \end{equation}\]

Exercise

Show that update rule for weights \(W\) and bias \(\mathbf{b}\) is:

\[\begin{equation} \begin{split} W & \leftarrow W - \frac{\eta }{N}\sum_{i=1}^N (y_i - \hat{y}_i) \hat{y}_i(1-\hat{y}_i)x_i\\ \mathbf{b} & \leftarrow \mathbf{b} - \frac{\eta }{N}\sum_{i=1}^N (y_i - \hat{y}_i) \hat{y}_i(1-\hat{y}_i) \end{split} \end{equation}\]

2 Single layer Neural Network

Here, we will consider a neural network with a single hidden layer and sigmoid activation functions.

Figure code

plot_nn(V=5, H = 4)

2.1 Network definition

In a neural network with a single hidden layer, we have two perceptrons: one between the input and the hidden layer, and one between the hidden layer and the output.

The input layer has the same size as the number of features in our data, i.e., \(m\) neurons. Then, the hidden layer has \(h\) neurons, and the output layer has as many neurons as classes \(c\). In a regression task, \(c=1\) as we predict a single scalar. Thus, the first weigth matrix \(W_1\) has shape \(m\times h\) and , and the second weight matrix \(W_2\) has shape \(h\times c\). In this case, we only consider biases in the hidden layer \(\mathbf{b}_1\) which is a vector with \(h\) entries.

2.1.1 Feed-forward pass

Now, let’s go through the feed-forward pass of the training data \(X\in\mathbb{R}^{n\times m}\) through our network.

The input goes through the first linear layer \(\mathbf{h} \leftarrow X^TW_1 + \mathbf{b}_1\) with shapes \([n, m] \times [m, h] = [n, h]\)
Then, we apply the activation function \(\hat{\mathbf{h}} \leftarrow \sigma(\mathbf{h})\) with shape \([n, h]\)
Then, we apply the second linear layer \(\mathbf{g} \leftarrow \mathbf{h}^TW_{2}\) with shapes \([n, h] \times [h, c] = [n, c]\)
Finally, we apply the activation function \(\hat{\mathbf{y}} \leftarrow \sigma(\mathbf{g})\) with shape \([n, c]\)

2.1.2 Parameter update

We will use the MSE loss function denoted in matrix rerpesentation as \(L = \frac{1}{2n}||Y - \hat{Y}||^2\),

The parameter update rule is \[\begin{equation} \begin{split} W_1 & = W_1 - \frac{\eta}{n}\frac{\partial L}{\partial W_1} \\ \mathbf{b}_1 & = \mathbf{b}_1 - \frac{\eta}{n}\frac{\partial L}{\partial \mathbf{b}_1} \\ W_2 & = W_2 - \frac{\eta}{n}\frac{\partial L}{\partial W_2}. \end{split} \end{equation}\]

Let us calculate gradients of the loss function with respect to \(W_1\), \(\mathbf{b}_1\) and \(W_2\) using the chain rule:

\[\begin{equation} \begin{split} \frac{\partial L}{\partial W_{2}} & = \frac{\partial L}{\partial \hat{Y}}\frac{\partial \hat{Y}}{\partial \mathbf{g}}\frac{\partial \mathbf{g}}{\partial W_{2}} \\ \frac{\partial L}{\partial W_{1}} & = \frac{\partial L}{\partial \hat{Y}}\frac{\partial \hat{Y}}{\partial \mathbf{g}}\frac{\partial \mathbf{g}}{\partial \hat{\mathbf{h}}}\frac{\partial \hat{\mathbf{h}}}{\partial \mathbf{h}}\frac{\partial \mathbf{h}}{\partial W_{1}}\\ \frac{\partial L}{\partial \mathbf{b}_{1}} & = \frac{\partial L}{\partial \hat{Y}}\frac{\partial \hat{Y}}{\partial \mathbf{g}}\frac{\partial \mathbf{g}}{\partial \hat{\mathbf{h}}}\frac{\partial \hat{\mathbf{h}}}{\partial \mathbf{h}}\frac{\partial \mathbf{h}}{\partial \mathbf{b}_{1}} \end{split} \end{equation}\]

We can write down every term:

\[\begin{equation} \begin{split} \frac{\partial L}{\partial \hat{Y}} & = \hat{Y} - Y \\ \frac{\partial \hat{Y}}{\partial \mathbf{g}} & = \hat{Y}(1-\hat{Y}) \\ \frac{\partial \mathbf{g}}{\partial W_{2}} & = \hat{\mathbf{h}} \\ \frac{\partial \mathbf{g}}{\partial \hat{\mathbf{h}}} & = W_{2} \\ \frac{\partial \hat{\mathbf{h}}}{\partial \mathbf{h}} & = \hat{\mathbf{h}}(1-\hat{\mathbf{h}}) \\ \frac{\partial \mathbf{h}}{\partial W_1} & = X\\ \frac{\partial \mathbf{h}}{\partial \mathbf{b}_1} & = \mathbb{1} \end{split} \end{equation}\]

Exercise

Show explicitly that \(Q_1\), and \(Q_2\) read: \[\begin{equation} \begin{split} Q_2 & \equiv \frac{\partial L}{\partial \hat{Y}}\frac{\partial \hat{Y}}{\partial \mathbf{g}} = (\hat{Y}-Y)\hat{Y}(1-\hat{Y}) \\ Q_1 & \equiv \frac{\partial L}{\partial \hat{Y}}\frac{\partial \hat{Y}}{\partial \mathbf{g}}\frac{\partial \mathbf{g}}{\partial \hat{\mathbf{h}}}\frac{\partial \hat{\mathbf{h}}}{\partial \mathbf{h}} = Q_2 W_{2}\hat{\mathbf{h}}(1-\hat{\mathbf{h}}) \end{split} \end{equation}\]

Exercise

Show that update rules for weights \(W_1\) and \(W_2\) are

\(W_2 = W_2 - \frac{\eta}{n}\hat{\mathbf{h}}^TQ_2\)

\(B_1 = B_1 - \frac{\eta}{n}Q_1\)

\(W_1 = W_1 - \frac{\eta}{n}X^TQ_1\)

Hint 1: Operations in \((\hat{Y}-Y)Y(1-Y)\) are element-wise multiplications.

Hint 2: Operations in \(\hat{\mathbf{h}}(1-\hat{\mathbf{h}})\) are element-wise multiplications.

Hint 3: The resulting weight updates must have the same dimension as the weight matrices.

2.2 Universal approximation theorem

While the previous may seem a trivial model, it has been proven that this model is indeed a universal approximator. In particular, the following theorem proves it (informal version):

A neural network with a single hidden layer and enough hidden neurons, using a suitable nonlinear activation function, can approximate any continuous function on a compact domain (for example, any function on \([0,1]^n\)) as closely as we want.

This means that in principle we could use the previous model for anything, even replicate an LLM. But of course, the previous is a theoretical result, an such a model is not practical: the width would be enourmous and training it could be a nightmare… We will see how to remedy this later this course. For now, let’s see how to train such model with python.

3 Example task: handwritten digits with the MNIST dataset

We test the concepts introduced above using the MNIST dataset. MNIST stands for Modified National Institute of Standards and Technology and the dataset consists of \(28\times28\) images of handwritten digits. Here, we will perform a regression task trying to predict the value of the digit from the image.

We will use the following architecture:

Input layer with \(m = 28\times28 = 784\) neurons.
Hidden layer with \(h = 25\) neurons.
Ouptut layer with \(c = 1\) neuron.

Figure code

plot_nn(V=20, H = 10, spacing = 1, figsize = (6,5), alpha = 0.3)

3.0.1 Process the data

We start by importing the MNIST dataset

train_dataset = datasets.MNIST(root='./data', train=True, download=True)
test_dataset  = datasets.MNIST(root='./data', train=False, download=True)

x_train, y_train = train_dataset.data.numpy(), train_dataset.targets.numpy()
x_test, y_test = test_dataset.data.numpy(), test_dataset.targets.numpy()

# split train and validation
np.random.seed(7)
idx_rnd = np.random.permutation(np.arange(x_train.shape[0]))
split = int(0.2*x_train.shape[0]) # Take 20% for validation
idx_val, idx_train = idx_rnd[:split], idx_rnd[split:]
x_val, y_val = x_train[idx_val], y_train[idx_val]
x_train, y_train = x_train[idx_train], y_train[idx_train]

n_train, n_val, n_test = x_train.shape[0], x_val.shape[0], x_test.shape[0]

Let’s have a look at some examples to get a better idea about the task.

Code

k = 10
for i in range(0,10):
    idx = np.where(y_train == i)[0] # find indices of i-digit
    fig, ax = plt.subplots(1, k, sharey=True)
    for j in range(0, k):
        ax[j].imshow(x_train[idx[j]])
    plt.show()

Before putting our images through the model, we first need to pre-process the data. We will:

Flatten the images
Normalize them \(\mathbf{x}_i \to \frac{\mathbf{x}_i - \text{mean}(\mathbf{x}_i)}{\text{std}(\mathbf{x})}\)
Because output of our network comes from a simgoid activation function in the range \((0, 1)\), we will bring the image labels \(y \in \{0,1,2,\dots,9\}\) to the \((0,1)\) range dividing by 10.

# Flatten the images
X_train = x_train.reshape(n_train, -1)
X_val = x_val.reshape(n_val, -1)
X_test = x_test.reshape(n_test, -1)

# Normalize the data
def normalize(x):
    return (x - np.mean(x, axis=1)[:, None])/np.std(x, axis=1)[:, None]

X_train = normalize(X_train)
X_val = normalize(X_val)
X_test = normalize(X_test)

# Bring the targets in range
Y_train = y_train/10.0
Y_val = y_val/10.0
Y_test = y_test/10.0

3.0.2 Define the model

Let’s define now the neural network parameters

# NN parameters
m = X_train.shape[1] # number of input neurons
h = 21               # number of hidden neurons
c = 1                # number of output neurons

Exercise

Using the theoretical expression we found above, compute the forward and backward passes for the single layer NN.

### Your code here
def sigmoid(x):
    '''Computes the sigmoid function of an input np.array'''
    return 1/(1+np.exp(-x))

def forward(x, w_1, b_1, w_2):
    "Forward pass through our neural network."
    
    return sigmoid(g), g, h_hat, h

def backward(x, y, y_pred, w_2, h_hat):
    "Backward pass through our neural network."

    return grad_w_1, grad_b_1, grad_w_2

Solution

def forward(x, w_1, b_1, w_2):
    "Forward pass through our neural network."
    h = x @ w_1 + b_1
    h_hat = sigmoid(h)
    g = h_hat @ w_2
    return sigmoid(g), g, h_hat, h

def sigmoid(x):
    return 1/(1+np.exp(-x))

def backward(x, y, y_pred, w_2, h_hat):
    "Backward pass through our neural network."
    Q_2 = (y_pred-y[:, None])*y_pred*(1-y_pred)
    Q_1 = Q_2 @ w_2.T*h_hat*(1-h_hat)

    grad_w_1 = x.T @ Q_1
    grad_b_1 = np.sum(Q_1, axis=0)
    grad_w_2 = h_hat.T @ Q_2

    return grad_w_1, grad_b_1, grad_w_2

3.0.3 Train the model

We can now train the model!

# Training parameters
eta = 50      # learning rate
n_epoch = 500 # training epochs

# Initialize the weights randomly
np.random.seed(0)
w_1 = 1*(np.random.rand(m, h) - 0.5)  
w_2 = 1*(np.random.rand(h, c) - 0.5)  
b_1 = 1*(np.random.rand(h)   - 0.5)

loss_train_vs_epoch = []
loss_val_vs_epoch = []

for epoch in tqdm(range(n_epoch)):
    # Forward pass
    Y_pred, g, h_hat, h = forward(X_train, w_1, b_1, w_2)    
    loss_train = 0.5*np.mean((Y_pred.squeeze() - Y_train)**2)    
    loss_train_vs_epoch.append(loss_train)

    # Backward pass
    grad_w_1, grad_b_1, grad_w_2 = backward(X_train, Y_train, Y_pred, w_2, h_hat)

    # Update parameters 
    w_1 -= eta/n_train*grad_w_1 
    b_1 -= eta/n_train*grad_b_1
    w_2 -= eta/n_train*grad_w_2

    # Validate the performance
    Y_pred, _, _, _ = forward(X_val, w_1, b_1, w_2)
    loss_val = 0.5*np.mean((Y_pred.squeeze() - Y_val)**2)    
    loss_val_vs_epoch.append(loss_val)

Code

print(f"Train and validation loss of {loss_train:.5f} and {loss_val:.5f}, respectively")

Train and validation loss of 0.00625 and 0.00735, respectively

Code

plt.figure(figsize=(6, 4))
plt.plot(loss_val_vs_epoch, label = "validation")
plt.plot(loss_train_vs_epoch, label = "training")
plt.grid(alpha=0.2)
plt.title("MNIST")
plt.xlabel("Epoch", fontsize=14)
plt.ylabel("Loss (MSE)", fontsize=14)
plt.legend()
plt.tick_params(labelsize=12);

Now, we can look at the performance over unseen data from the test set.

Code

Y_pred, _, _, _ = forward(X_test, w_1, b_1, w_2)
loss_test = 0.5*np.mean((Y_pred.squeeze() - Y_test)**2)
print(f"The test loss is {loss_test:.5f}")

The test loss is 0.00699

The model seems to generalize fairly well, as the performance is comparable to the one obtained in the training set. Indeed, looking at the training losses, we see that the model is barely overfitting as there is almost no difference between the training and validation loss. This is mainly due to the simplicity of the model that we are considering.

We can also pretend for a moment that this is a classification task. This is definitely not how you would frame a classification problem, but we can assign prediction intervals to the MNIST labels and see how we would do.

# Train set prediction
pred_train, _, _, _ = forward(X_train, w_1, b_1, w_2)
pred_train = np.around(10*pred_train).astype(int).squeeze()
true_train = (10*Y_train).astype(int)
conf_mat_train = get_confusion_matrix(pred_train, true_train)
accuracy_train = (pred_train == true_train).mean()

# Test set prediction
pred_test, _, _, _ = forward(X_test, w_1, b_1, w_2)
pred_test = np.around(10*pred_test).astype(int).squeeze()
true_test = (10*Y_test).astype(int)
conf_mat_test = get_confusion_matrix(pred_test, true_test)
accuracy_test = (pred_test == true_test).mean()

Code

fig, ax = plt.subplots(1, 2, sharey=True, constrained_layout=True)
ax[0].imshow(conf_mat_train)
ax[0].set_title("Train | Accuracy : " + "{:2.2f}".format(accuracy_train*100) + "%")
ax[0].set_xlabel("True label", fontsize=14)
ax[0].set_ylabel("Predicted label", fontsize=14)
ax[0].set_xticks(np.arange(0, 10))
ax[0].set_yticks(np.arange(0, 10))

ax[1].imshow(conf_mat_test)
ax[1].set_title("Test | Accuracy : " + "{:2.2f}".format(accuracy_test*100) + "%")
ax[1].set_xlabel("True label", fontsize=14)
ax[1].set_xticks(np.arange(0, 10))
ax[1].set_yticks(np.arange(0, 10));

As we can see, the accuracy matrix has quite diagonal structure! With an accuracy far beyond what we would obtain from a random guess! We see, however, that most errors occur between consecutive classes, which is mainly due to rounding errors from the imperfect regression.

Note

We reiterate that this is not the proper way to handle a classification task. This is just an academic experiment to get familiar with the perceptron and see that neural networks are just a bunch of affine transformations.

Let’s see if we can get a better understanding of the model by looking at some predictions:

Code

k = 5
for i in range(10):
    idx = np.where(true_test == i)[0] # find indices of i-digit
    fig, ax = plt.subplots(1, k, sharey=True)
    for j in range(k):
        title_string = "\n P: " + "{:01d}".format(pred_test[idx[j]]))
        ax[j].set_title(title_string)
        ax[j].imshow(X_test[idx[j], :].reshape(28, 28))
    # plt.show()

In this examples, we see more clearly that, indeed, most errors occur due to categories being close to each other. For instance, all the errors in the images with 6s are either 5s or 7s. This is one of the main reasons why classification problems are not framed this way, but rather we treat every class as an independent instance of the rest.

Exercise

Set initial biases to zero, and freeze its training. Check the change in the confusion matrix and accuracy.

Exercise

Check and compare the prediction accuracy and confusion matrix for weights and bias initialization taken from: 1. Uniform distribution [-0.5,0.5] 2. Uniform distribution [0,1] 3. Normal distribution \({\cal N}(0,1)\)

3.1 Activation functions

So far we have been using softmax \(\sigma(z) = \frac{1}{1+e^{-x}}\) activation function only. The other activation functions are:

Loss function should be calculated accordignly to the given activation function!

4 Optimization algorithms

There are many different optimization algorithms that can be used to train neural networks, and choosing the proper algorithm is essential to obtain a performant and well-trained model (Goodfellow, Bengio, and Courville 2016).

In general, optimization algorithms can be divided into two categories: first-order methods, which only use the gradient of the loss function with respect to the model’s parameters, and second-order methods, which also use the second derivative (or Hessian matrix). Second-order methods can be more computationally expensive, but they may also be more effective in certain cases.

4.1 Stochastic gradient descent (SGD)

In the standard gradient descent, we compute the gradient of the cost function with respect to the parameters for the entire training dataset. In most cases, it is extremely slow and even intractable for datasets that don’t even fit in memory. It also doesn’t allow us to update our model online, i.e. with new examples on-the-fly.

In SGD gradient descent, we use mini-batches comprised of a few training samples, and the model’s parameters are updated based on the average loss across the samples in each mini-batch. This way, SGD is able to make faster progress through the training dataset, and it can also make use of vectorized operations, which can make the training process more efficient.

4.2 Momentum

Momentum optimization is an algorithm that can be used to improve SGD. It works by adding a fraction \(\gamma\) of the previous parameter update to the current one, which helps the model make faster progress in the right direction and avoid getting stuck in local minima. This fraction is called the momentum coefficient, and it is a hyperparameter that can be adjusted according to the problem.

The momentum algorithm accumulates a history of the past gradients and continues to move in their direction:

\[\begin{equation} \begin{split} g_t &= \frac{\partial L(\theta_{t-1})}{\partial \theta}\\ v_t &= \gamma v_{t-1} - \eta g_t \\ \theta &= \theta + v_t, \end{split} \end{equation}\] where \(t\) enumerates training epoch, \(\theta\) are the trainable parameters of the Neural Network, \(\gamma\) is the momentum coefficient and \(\eta\) is the learning rate.

The velocity \(v\) accumulates the gradient of the loss function \(L\); the larger \(\gamma\) with respect to \(\eta\), the more previous gradients affect the current direction. In the standard SGD algorithm, the update size depended on the gradient and the learning rate. With momentum, it also depends on how large and how aligned consecutive gradients are. In addition to speeding up training, momentum optimization can also help the model to generalize better to new data.

4.3 Adaptative Gradient (Adagrad)

Adaptative Gradient algorithm (Duchi, Hazan, and Singer 2011) is based on the idea of adapting the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters.

The AdaGrad algorithm works by accumulating the squares of the gradients for each parameter, and then scaling the learning rate for each parameter by the inverse square root of this sum. This has the effect of reducing the learning rate for parameters that have been updated frequently, and increasing the learning rate for parameters that have been updated infrequently.

The update rule for AdaGrad algorithm reads

\[\begin{equation} \begin{split} \theta_{t+1} & = \theta_t + \Delta\theta, \end{split} \end{equation}\]

\[\begin{equation} \begin{split} \Delta \theta &= - \frac{\eta}{\sqrt{diag( \epsilon\mathbb{1} + G_t )}} \odot g_t,\\ g_t &= \frac{\partial L(\theta_{t-1})}{\partial \theta}\\ G_t &= \sum_{\tau = 1}^{t} g_\tau g_\tau^T. \end{split} \end{equation}\] where \(\odot\) means element-wise multiplication. The \(\epsilon \ll 0\) is a regularizing parameter, preventing from division by 0.

Adagrad eliminates the need to manually tune the learning rate, i.e. initially \(\eta \ll 1\), and it is effectively adapted during training process. Algorithm is quite sensitive to the choice of the initial learning rate, and it may require careful tuning to achieve good results.

4.4 Adaptive Moment Estimation (Adam)

Adam algorithm (Kingma and Ba 2014) combines the ideas of momentum optimization and Adagrad to make more stable updates and achieve faster convergence.

Like momentum optimization, Adam uses an exponentially decaying average of the previous gradients to determine the direction of the update. This helps the model to make faster progress in the right direction and avoid oscillations. Like AdaGrad, Adam also scales the learning rate for each parameter based on the inverse square root of an exponentially decaying average of the squared gradients. This has the effect of reducing the learning rate for parameters that have been updated frequently, and increasing the learning rate for parameters that have been updated infrequently.

Adam uses Exponentially Modified Moving Average for gradients and its square:

\[\begin{equation} \begin{split} g_t &= \frac{\partial L(\theta_{t-1})}{\partial \theta}\\ m_t &= \beta_1 m_{t-1} + (1-\beta_1)g_t \\ v_t &= \beta_2 v_{t-1} + (1-\beta_2)g_t^2. \end{split} \end{equation}\]

The update rule for the parameters reads:

\[\begin{equation} \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon}\hat{m}_t, \end{equation}\] where \[\begin{equation} \begin{split} \hat{m}_t &= \frac{m_t}{1-\beta^t_1}\\ \hat{v}_t &= \frac{v_t}{1-\beta^t_2}, \end{split} \end{equation}\] are bias-corrected first and second gradient moments estimates.

Authors suggest to set \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\eta = 10^{-8}\).

Exercise

Based on the full gradient loop we used above, implement the SGD algorithm over the MNIST dataset. A reminder of what the code should look like:

Set batch size, learning rate, epochs and any other training hyperparameter.
Loop over training epochs:
- Loop over number of batches:
  - Sample a random batch from the training set. For instance: batch = X_train[np.random.randint(0, X_train.shape[0], batch_size)]
  - Forward pass, backward, network update.
- Track the test and training loss

Bonus: Implement the Adam optimizer in the previous loop.

Solution (pseudocode)

# Define the batch size and calculate the number of batches
batch_size = 64
number_batches = int(X_train.shape[0]/batch_size)

# Let's do a single epoch:
n_epoch = 1

# We loop over epoch
for epoch in (range(n_epoch)):

    # We loop over the number of batches
    for _ in tqdm(range(number_batches)):

        # Sample randomly one batch 
        idx_batch = np.random.randint(0, X_train.shape[0], batch_size)
        X_batch = X_train[idx_batch]
        Y_batch = Y_train[idx_batch]    
    
        # Forward pass only over the batch
        Y_pred, g, h_hat, h = forward(X_batch, w_1, b_1, w_2)    
        loss_train = 0.5*np.mean((Y_pred.squeeze() - Y_batch)**2)    
        loss_train_vs_epoch.append(loss_train)
    
        # Update parameters based on the loss of the current batch
        grad_w_1, grad_b_1, grad_w_2 = backward(X_batch, Y_batch, Y_pred, w_2, h_hat)
        w_1 -= eta/n_train*grad_w_1 
        b_1 -= eta/n_train*grad_b_1
        w_2 -= eta/n_train*grad_w_2

References

Duchi, John, Elad Hazan, and Yoram Singer. 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” J. Mach. Learn. Res. 12 (July): 2121–59. https://doi.org/10.5555/1953048.2021068.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. The MIT Press. https://doi.org/10.5555/3086952.

Kingma, Diederik P., and Jimmy Ba. 2014. “Adam: A Method for Stochastic Optimization.” https://arxiv.org/abs/1412.6980.