Coding A Neural Network From Scratch Using Python

34 min readFeb 1, 2022

Continuing from the previous article, let’s code a neural network using only NumPy. The previous article showed the math behind a neural network, which we are going to code right now.

In this article, you will learn:

How to code the forward pass in a neural network
How to evaluate the performance of a neural network using Python
How to code a function to update the parameters of a neural network using backpropagation

Before going into the implementation of a neural network using Python, you will likely need to know how classes in Python work. Since we are going to be using NumPy, it may also be a good idea to understand how NumPy works. Additionally, like in the last article, you will likely need to know a little linear algebra and have knowledge of derivatives from Calculus.

To get started, you will need to install the following Python libraries. I put the library versions that work for me in case the code only works on that version of the library:

NumPy = 1.19.5
scikit-learn (sklearn) = 0.24.2
Matplotlib = 3.4.2

When testing the code, I used Python 3.8.10 in case the code below doesn’t work on some other version. Though, any Python version that supports the package versions above should be able to run the code below.

To start the code off, import the libraries using the following lines of code:

import numpy as np
import sklearn
from sklearn import datasets
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import matplotlib
matplotlib.use('TkAgg')
from matplotlib.colors import ListedColormap
import matplotlib.gridspec as gridspec

We are going to use NumPy for all the calculations, sklearn to easily retrieve a set of data, and Matplotlib to graph the neural network’s predictions.

Fully Connected Class

To start, let’s code a fully connected layer which we will use in our neural network. A fully connected layer is just a layer of nodes where each node is connected to all nodes in the previous layer and all nodes in the proceeding layer. This structure is what basic neural networks are made of. Below is an image of a neural network.

Titanic Prediction with Artificial Neural Network in R. | LaptrinhX — https://laptrinhx.com/titanic-prediction-with-artificial-neural-network-in-r-3087367370/

Specifically, we want to code a layer that takes in some number of inputs and gives us some number of outputs. This way, we can generalize the layer to be both a hidden layer and an output layer.

Looking at a single layer in a neural network, we notice it has many shared properties among other fully connected layers:

The layer has some number of inputs
The layer has some number of nodes
The layer uses some activation function which is the same across all nodes in the layer
The layer has a weight connecting each input to each node in the layer
Each node in the layer has a bias
Each node is computed by taking the dot product of the weights and inputs plus a bias term where the result is sent through an activation function

We will use the above properties to generalize our fully connected layer. Now let’s start coding this structure.

To begin, we will call the class FullyConnected and have the constructor take in the first three properties listed above.

# A fully connected layer for the neural network to use
class FullyConnected:
    # Initializes a fully connected layer
    # Parameters:
    #   numInputs - Number of inputs from previous layer
    #   numNodes - Number of nodes in this layer
    #   activation - The activation function to use
    def __init__(self, numInputs, numNodes, activation):
        # Initialize the number of inputs, number of
        # nodes, and activation function
        self.numInputs = numInputs
        self.numNodes = numNodes
        self.activation = activation

The fully connected layer also needs to store a matrix of weights which will be used to compute the result for each node. The weight matrix is of the following shape:

number of nodes x number of inputs

For example, if a network has two inputs and three nodes, the resulting weight matrix would be of shape 3x2, where each row corresponds to the weights connecting one node to all layer inputs.

Fully Connected Layer With 2 Inputs and 3 Nodes

Additionally, each node has a bias, so the number of biases needed is equal to the number of nodes in the hidden layer.

One problem that we run into is where to get the values for the weights and biases from. To do this, we randomly initialize these values. Usually, it is good practice to keep them low, which helps the model start in a state where it can learn faster. In our case, we are going to initialize the weights and biases using a normal distribution with values between -1 and 1.

Adding these two properties to the FullyConnected class results in the following constructor:

# A fully connected layer for the neural network to use
class FullyConnected:
    # Initializes a fully connected layer
    # Parameters:
    #   numInputs - Number of inputs from previous layer
    #   numNodes - Number of nodes in this layer
    #   activation - The activation function to use
    def __init__(self, numInputs, numNodes, activation):
        # Initialize the number of inputs, number of
        # nodes, and activation function
        self.numInputs = numInputs
        self.numNodes = numNodes
        self.activation = activation        # Initialize the weights to random values.
        # The matrix will be of size:
        #      [self.numNodes, self.numInputs]
        # The values will be between -1 and 1.
        self.weights = np.random.normal(-1, 1,\
                            (self.numNodes, self.numInputs))        # Initialize the biases to random values
        # The vector will be of size [self.numNodes] as
        # there is 1 bias per node. The value will be between
        # -1 and 1.
        self.biases = np.random.normal(-1, 1, self.numNodes)

Now that we have a solid constructor for our FullyConnected class, let’s add a forward method to get predictions from each of the nodes. Remember that the formula to get a prediction from all the nodes is as follows.

Let’s code translate this formula into code. Our forward method needs to accept x, the inputs from the previous layer.

    # Given a set of inputs, returns the value of the
    #   feed forward method
    # Parameters:
    #   inputs - An array of inputs of size
    #            [self.numInputs, batchSize]
    # Outputs:
    #   The output from the layer
    #   The output from the layer before going through
    #      the activation function
    def forward(self, inputs):
        # If "self.numInputs" number of inputs was not supplied,
        # return False
        if np.array(inputs).shape[0] != self.numInputs:
            raise NameError("Inputs not correct shape")        # Get the value of each node by taking the dot of
        # each input and it's weights. The result should be a
        # vector of size [self.numNodes, batchSize]
        z = np.dot(inputs.T, self.weights.T)        # Add the biases to each node output
        z = (np.array(z)+self.biases).T

The inputs are transposed due to the way the data is formatted. The goal of transposing both the inputs and weights before taking the dot product is to line up the shapes of the matrices. When the shapes are lined up, the dot product can be taken.

After computing the z value, we need to calculate the a value or the activation value. There are many different options when choosing activation functions, but for our class, we are going to stick the following three functions:

ReLU for hidden layers
Sigmoid for either hidden or output layers
No activation function (where a = z)

Let’s code the rest of the forward function:

    # Given a set of inputs, returns the value of the
    #   feed forward method
    # Parameters:
    #   inputs - An array of inputs of size
    #            [self.numInputs, batchSize]
    # Outputs:
    #   The output from the layer
    #   The output from the layer before going through
    #      the activation function
    def forward(self, inputs):
        # If "self.numInputs" number of inputs was not supplied,
        # return False
        if np.array(inputs).shape[0] != self.numInputs:
            raise NameError("Inputs not correct shape")        # Get the value of each node by taking the dot of
        # each input and it's weights. The result should be a
        # vector of size [self.numNodes, batchSize]
        z = np.dot(inputs.T, self.weights.T)        # Add the biases to each node output
        z = (np.array(z)+self.biases).T
        
        # Send each node value through the activation function if
        # the activation function is specified
        # Return the output and z value to be stored in the cache.
        if self.activation == "relu":
            return np.where(z > 0, z, z*0.05), z
        elif self.activation == "sigmoid":
            return 1/(1 + np.exp(-1*z)), z
        else:
            return z, z

Note that the function returns two outputs. These outputs are the z values and the a values. The reason we want the z values is for the backpropagation part of the program. The derivatives may need to use not just the a values, but also the z values.

An important note about the ReLU function in our code is that it doesn’t quite look like a ReLU function. In a ReLU function, we take the max value between z and 0, but in our case, we are reducing the values less than or equal to 0 by multiplying them by 0.05. If you take a look at a ReLU function, you will notice that the derivative of this function is 0 when the value is less than 0. A major issue with this is that it stops the network from learning. Recall that we use the chain rule to update the weights, and since the derivative is 0, we are multiplying all derivatives that use this ReLU derivative by 0. So, no update occurs to those weights

To solve this issue, we use a function called Leaky ReLU. Instead of having a slope of 0 when the input is 0, the slope is slightly greater than 0. Now that the slope is not 0, we can effectively use the chain rule to update parameters that rely on this ReLU value while also maintaining the effectiveness of the ReLU activation function.

In our case, we are going to use a slope of a = 0.05 for our LeakyReLU function

https://paperswithcode.com/method/leaky-relu

That’s all there is to the FullyConnected structure. Below is the entire code for this class:

# A fully connected layer for the neural network to use
class FullyConnected:
    # Initializes a fully connected layer
    # Parameters:
    #   numInputs - Number of inputs from previous layer
    #   numNodes - Number of nodes in this layer
    #   activation - The activation function to use
    def __init__(self, numInputs, numNodes, activation):
        # Initialize the number of inputs, number of
        # nodes, and activation function
        self.numInputs = numInputs
        self.numNodes = numNodes
        self.activation = activation        # Initialize the weights to random values.
        # The matrix will be of size:
        #      [self.numNodes, self.numInputs]
        # The values will be between -1 and 1.
        self.weights = np.random.normal(-1, 1,\
                            (self.numNodes, self.numInputs))        # Initialize the biases to random values
        # The vector will be of size [self.numNodes] as
        # there is 1 bias per node. The value will be between
        # -1 and 1.
        self.biases = np.random.normal(-1, 1, self.numNodes)
    
    # Given a set of inputs, returns the value of the
    #   feed forward method
    # Parameters:
    #   inputs - An array of inputs of size
    #            [self.numInputs, batchSize]
    # Outputs:
    #   The output from the layer
    #   The output from the layer before going through
    #      the activation function
    def forward(self, inputs):
        # If "self.numInputs" number of inputs was not supplied,
        # return False
        if np.array(inputs).shape[0] != self.numInputs:
            raise NameError("Inputs not correct shape")        # Get the value of each node by taking the dot of
        # each input and it's weights. The result should be a
        # vector of size [self.numNodes, batchSize]
        z = np.dot(inputs.T, self.weights.T)        # Add the biases to each node output
        z = (np.array(z)+self.biases).T
        
        # Send each node value through the activation function if
        # the activation function is specified
        # Return the output and z value to be stored in the cache.
        if self.activation == "relu":
            return np.where(z > 0, z, z*0.05), z
        elif self.activation == "softmax":
            return special.softmax(z, axis=-1), z
        elif self.activation == "sigmoid":
            return 1/(1 + np.exp(-1*z)), z
        else:
            return z, z

Let’s make sure this class actually works. We will use the image found below to ensure the layer works as desired:

Example Input and Output For A Fully Connected Layer

To begin, we will create the FullyConnected layer object with 2 inputs, 3 nodes, and a ReLU activation function:

import numpy as np
# Create the layer
numInputs = 2 # Three inputs into the layer
numNodes = 3 # Three nodes in the layer
activation = "relu" # Use ReLU for the activation function
layer = FullyConnected(numInputs, numNodes, activation)

Now, let’s change the weights and biases in the layer to match the example:

# Change the weights and biases so we can test the layer
weights = np.array([[0.5, 1],
                    [0.1, -0.5],
                    [-0.1, 0.1]])
biases = np.array([1, -1, 0])
layer.weights = weights
layer.biases = biases

Finally, let’s see if the FullyConnected layer gives us the correct output:

# Get the prediction from the layer
inputs = np.array([5, 2])
a, z = layer.forward(inputs)
print(f"Values before activation: {z}")
print(f"Final values: {a}")

The final code block gives the following output:

Values before activation: [ 5.5 -1.5 -0.3]
Final values: [ 5.5   -0.075 -0.015]

The z values and a values both match the expected outputs found in the example, so the layer seems to be working properly.

Neural Network Class — Constructor

Now that a fully connected neural network layer can easily be created only by specifying the number of inputs, the number of nodes, and the activation function, we can code a class to create a neural network. Let’s start by going over the properties of a neural network:

Image of neural network with three fully connected layers

Looking at the neural network in its entirety, we notice it has the following general properties:

The network has some number of initial inputs greater than 0
The network has some number of layers greater than 0
Each layer may have a different number of nodes
The activation function may differ between layers

These 4 properties make up our neural network and are the parameters into our NeuralNetwork class:

class NeuralNetwork():
    # Initializes a neural network
    # Parameters:
    #   numInputs - number of inputs
    #   numLayers - number of layers
    #   layerSizes - size of each layer:
    #                vector of size [number of layers]
    #   layerActivations - activation for each layer: 
    #                      vector of size [number of layers]
    def __init__(self, numInputs, numLayers, layerSizes, \
            layerActivations):
        # Initialize the number of inputs and number of layers
        self.numInputs = numInputs
        self.numLayers = numLayers

In the code above, we take in 4 parameters:

numInputs — An integer representing the number of inputs into the network (values can be 1 or greater)
numLayers —An integer representing the number of fully connected layers in the network (values can be 1 or greater)
layerSizes — An array of integers which has numLayers number of elements where each element corresponds to the number of nodes in that layer.
layerActivations — An array of strings which has numLayers number of elements where each element corresponds to the activation function for that layer.

Since we have a class that handles each fully connected layer, we can define a network using the layers in a network given layerSizes and the layerActivations. To do this, we create a numLayers number of FullyConnected objects. Recall that a FullyConnected layer is defined by three values:

numInputs (This value is given from the previous layer)
numNodes (This value is given from the respective element in the layerSizes array)
activation (This value is given from the respective element in the layerActivations array)

Now let’s finish off the constructor for this class:

class NeuralNetwork():
    # Initializes a neural network
    # Parameters:
    #   numInputs - number of inputs
    #   numLayers - number of layers
    #   layerSizes - size of each layer:
    #                vector of size [number of layers]
    #   layerActivations - activation for each layer: 
    #                      vector of size [number of layers]
    def __init__(self, numInputs, numLayers, layerSizes, \
            layerActivations):
        # Initialize the number of inputs and number of layers
        self.numInputs = numInputs
        self.numLayers = numLayers        # Initialize an array to hold each layer object
        self.layers = []        # Create "numLayers" new layers
        for i in range(0, self.numLayers):
            self.layers.append(FullyConnected(numInputs, \
                layerSizes[i], layerActivations[i]))
            numInputs = layerSizes[i]

We can now code the example shown above. The code to that is as follows:

numInputs = 3 # Three inputs into the network
numLayers = 3 # Three layers in the entire network
layerSizes = [5, 4, 1] # The first layer has 5 nodes,
                       # The second has 4 nodes,
                       # and the final has 1 node
layerActivations = ["relu", "relu", sigmoid"]
                       # The first and second layers
                       # use relu and the final layer
                       # uses sigmoid
model = NeuralNetwork(numInputs, numLayers, layerSizes, \
                      layerActivations)

That’s all we need to define a neural network structure. Of course, we can’t do anything with the NeuralNetwork structure yet, so in the next section, we will define the forward function.

Neural Network Class — Forward

Having a neural network that can’t do anything is great and all, but let’s make some predictions on some data using the network. This process is practically already done since we can rely on the FullyConnected structure to handle all forward passes.

To get a complete prediction from a neural network, the inputs are sent through all layers in the network. We can do this by using a for loop that iterates over all layers in the network and sends the previous layers’ outputs as input to the current layer:

    # Given a set of inputs, returns the value of the feedforward
    # method by sending the values through each layer's feedforward
    # method.
    # Parameters:
    #   inputs - An array of inputs of size
    #            [self.numInputs, batchSize]
    # Outputs:
    #   inputs - An array containing an output from each output
    def forward(self, inputs):
        # Iterate over every layer in the network and send the
        # inputs (or outputs from the previous layer) through
        # the forward pass for each layer.
        for i in range(0, self.numLayers):
            # Get the output from the layers
            inputs, z = self.layers[i].forward(inputs)        return inputs

The forward method is easy enough. One detail that’s missing is a cache feature which is a necessity for the backward method when we begin implementing it. So, instead of just saving the output values, we want to save all intermediate values in case the derivatives use them. The code can be reconstructed as the following:

    # Given a set of inputs, returns the value of the feedforward
    # method by sending the values through each layer's feedforward
    # method.
    # Parameters:
    #   inputs - An array of inputs of size
    #            [self.numInputs, batchSize]
    # Outputs:
    #   inputs - An array containing an output from each output
    #            node.
    #   cache - A dictionary containing cached values from each
    #           layer in the feedforward method.
    def forward(self, inputs):
        # This cache holds calculated values from each layer
        # to be used in backpropagation
        cache = dict()        # Iterate over every layer in the network and send the
        # inputs (or outputs from the previous layer) through
        # the forward pass for each layer. Add each outputs
        # from each layer to the cache along the way
        for i in range(0, self.numLayers):
            # Add the inputs to the cache
            cache["a" + str(i-1)] = inputs            # Get the output from the layers
            inputs, z = self.layers[i].forward(inputs)            # Add the output and z-value to the cache
            cache["a" + str(i)] = inputs
            cache["z" + str(i)] = z
        
        # Return the output value
        return inputs, cache

Still, the function is very straightforward. Before moving on to the backward function, we need to make sure this function works. In particular, we want to make sure we get some sort of output that looks correct. Since we know that the FullyConnected structure works, there is no need to test if the output of each layer is correct.

To do so, let’s use the example in the previous section, which has three layers. To start, let’s create the model:

# Create the model
numInputs = 3 # Three inputs into the network
numLayers = 3 # Three layers in the entire network
layerSizes = [5, 4, 1] # The first layer has 5 nodes,
                       # The second has 4 nodes,
                       # and the final has 1 node
layerActivations = ["relu", "relu", "sigmoid"]
                       # The first and second layers
                       # use relu and the final layer
                       # uses sigmoid
model = NeuralNetwork(numInputs, numLayers, layerSizes, \
                      layerActivations)

Now we want to get a prediction for the values 1, 2, and 3

# Test the model
a = model.forward(np.array([1,2,3]))
print(a)

The result is a single value of 0.03195935, which seems to be correct. Sigmoid should return a value between 0 and 1, and 0.03195935 falls between that range. Also, we specified that we wanted a single output, which we got.

Loss Function

Before training the network, we need to evaluate the network. In the previous article, I explained how the Binary Cross Entropy (BCE) loss function works, which is the loss function we are going to be using in our network. I won’t go into detail about how the loss function works in this article, but for a quick summary, just remember that it evaluates how close a predicted value between 0 and 1 is to the actual value between 0 and 1. The function looks like the following:

In the function above, the following terms are defined as:

N = The number of training examples
yᵢ = The true label
ŷᵢ = The predicted label (what the network predicted)

To start, let’s create a loss function that takes in y and ŷ, where y is a matrix of actual labels from the dataset and ŷ is a matrix of the same size as y with predictions from the model.

# The loss function for the neural network. Since the network
# is using a binary classifier, the loss function is
# binary cross entropy
# Parameters:
# - yhat: Predictions from the neural network
# - y: The values that the predictions should be
def binaryCrossEntropy(yhat, y):
    # Make sure the inputs are numpy arrays
    yhat = np.array(yhat)
    y = np.array(y)
    
    # Compute the reversed values
    yhat_rev = 1-yhat

We also defined yhat_rev, which is the reversed value of y_hat. Now let’s compute the two terms of the loss function where the leftmost term will be called log1, and the rightmost value will be called log2.

# The loss function for the neural network. Since the network
# is using a binary classifier, the loss function is
# binary cross entropy
# Parameters:
# - yhat: Predictions from the neural network
# - y: The values that the predictions should be
def binaryCrossEntropy(yhat, y):
    # Make sure the inputs are numpy arrays
    yhat = np.array(yhat)
    y = np.array(y)
    
    # Compute the reversed values
    yhat_rev = 1-yhat

    # Compute the log terms
    log1 = y*np.log(yhat)
    log2 = (1-y)*np.log(yhat_rev)

Finally, take the sum of the two terms, average them (using 1/N), and multiply by -1.

# The loss function for the neural network. Since the network
# is using a binary classifier, the loss function is
# binary cross entropy
# Parameters:
# - yhat: Predictions from the neural network
# - y: The values that the predictions should be
def binaryCrossEntropy(yhat, y):
    # Make sure the inputs are numpy arrays
    yhat = np.array(yhat)
    y = np.array(y)
    
    # Compute the reversed values
    yhat_rev = 1-yhat

    # Compute the log terms
    log1 = y*np.log(yhat)
    log2 = (1-y)*np.log(yhat_rev)

    # Return the loss
    return -1*(1/y.shape[1])*(np.sum(log1 + log2))

Before finishing the loss function, there is a slight problem with it. When calculating the log of 0, we find the value is undefined, and the limit is negative infinity. This issue can “kill” the network as any derivative that uses that term will also become negative infinity or infinity.

To fix the issue of infinite values, we can add a really small value to yhat_rev values that are 0. Adding on a small constant value keeps infinite values from popping up while mostly keeping the meaning of that value. Adding this constant makes the final loss function look like the following:

# The loss function for the neural network. Since the network
# is using a binary classifier, the loss function is
# binary cross entropy
# Parameters:
# - yhat: Predictions from the neural network
# - y: The values that the predictions should be
def binaryCrossEntropy(yhat, y):
    # Make sure the inputs are numpy arrays
    yhat = np.array(yhat)
    y = np.array(y)
    
    # Compute the reversed values
    yhat_rev = 1-yhat
    
    # Change values of 0 within yhat_rev slightly
    # to avoid the loss from becoming nan or infinity.
    yhat_rev = np.where(yhat_rev==0, yhat_rev+0.0000001, yhat_rev)    # Compute the log terms
    log1 = y*np.log(yhat)
    log2 = (1-y)*np.log(yhat_rev)    # Return the loss
    return -1*(1/y.shape[1])*(np.sum(log1 + log2))

Note that the function above is not a part of the NeuralNetwork class.

Now that we have the forward function to make predictions and a loss function to evaluate the network, we need to somehow train the network, which will be done in the next section.

Neural Network Class — Backward

The backward pass uses partial derivatives from a loss function to train the model. The goal of this function is to generalize the backward function so that it works with all three activation functions with any number of nodes and any number of layers. Let the derivative fun begin :)

To start the backward function, we will need the following parameters:

cache — A dictionary holding intermediate values in a neural network
X_train — The inputs from a dataset
y_train — The expected outputs from a dataset
alpha — The learning rate which tells the model how quickly we want it to learn (higher is not always better)

The parameters above will be the inputs into the function. The output of the function will be the cache which will be used to make a graph.

In case there is any confusion on notation, I will leave some nation here:

L = The loss function
y = The label or value we want the network to predict
ŷ = The prediction from the network
z = The set of values from a layer before the activation function is applied
a = (activation values) The set of values from a layer after the activation function is applied
b = The set of biases for a layer
w = The matrix of weights connecting a layer to the previous layer
da = derivative of the activation values
dz = derivative of the z values
dL = derivative of the loss function
db = derivative of the biases
dw = derivative of the weights
n = The nₜₕ layer

Note that I will use n only if values between different layers are going to be used.

Before coding the derivatives for each layer, let’s look at how to calculate the derivatives between layers real quick.

In the example above, let:

a₁ be the activation value for the first layer
a₂ be the activation value for the second layer
a₃ be the activation value for the output layer
z₁ be the z value for the first layer
and so on for all other variables w, b, and z

The function for the output layer is as follows:

The derivatives of the final fully connected layer w₃ and b₃ are easy enough to calculate, but what about for the second layer?

To help visualize the derivatives for layer 2, let’s look at the entire formula starting with layer 2. Note that we replace x₃ with a₂ because the output from the previous layer is the input to the current layer, so x₃ is the same as a₂.

So taking the derivatives of w₂ and b₂ are as follows:

The same pattern applies to all layers and is what we will use in a for loop to update the weights and biases. The pattern is as follows:

Calculate dL/da and store it
Calculate dL/dz by using dL/da*da/dz and store it
Calculate dL/dw by using dL/dz*dz/dw and store it
Calculate dL/db by using dL/dz*dz/db and store it
Update the weights and biases in the layer using the derivatives
Repeat steps 1–5 until there are no more layers to update

Let’s start with dL/da. Remember that backpropagation starts from the loss function and computes the gradients from right to left. So, let’s start by taking the derivative of the BCE loss function in terms of the prediction (represented as ŷ or a). The derivative of the BCE is shown below:

Derivative of the loss function in terms of a or ŷ

But what if we want the derivative of a in a hidden layer. In this case, the derivative is:

The code translation is as follows with both possibilities of the da values:

    # Given the cache and other parameters, this method calculates
    # the derivatives needed to perform one step of backproagation.
    # As the derivatives are calculated, the weights and biases
    # are updated
    # Parameters:
    #   cache - Values from the forward pass which will be needed to
    #           calculate the gradients and update the weights
    #           and biases
    #   X_train - Inputs values used to train the model
    #   y_train - Output values used to train the model
    #   alpha - The learning rate which is the rate at which the
    #           derivatives effect the weights and biases.
    # Outputs:
    #   cache - Values from the forward pass as well as values
    #           from the backward pass.
    def backward(self, cache, X_train, y_train, alpha):
        # Starting at the loss function, calculate the partial
        # derivatve of the loss function
        cache["da" + str(self.numLayers-1)] = (-y_train/cache["a" +\
                     str(self.numLayers-1)])+((1-y_train)/(1-\
                     cache["a" + str(self.numLayers-1)]))        # Iterate through each layer starting with the last one
        # and calculate the partial derivatives needed to find 
        # the gradeints for the weights and biases
        for i in reversed(range(0, self.numLayers)):
            # Calculate the derivative of the activation function in
            # terms of the loss function if the layer is a
            # hidden layer
            if i != self.numLayers-1:
                cache["da" + str(i)] = np.sum(cache["dweights" + \
                    str(i+1)], axis=-2).reshape(cache["z" + \
                    str(i)].shape)

Note that we are storing the derivative of the loss as the final da value in the cache. Using the cache, we can easily store and access variables for each layer in the backward pass through intuitive names like dw1 and dw2. To allow the network to have any number of layers, we will use a loop to iterate over all layers in the network from the last layer to the first layer to compute the gradients for all layers.

Now we need to code the derivative of the z value or dL/dz.

Remember that the three activation functions we are using are ReLU, Sigmoid, and no function. The derivatives of the three functions are as follows:

I’m not going to go over the derivation on the sigmoid activation function, but I will go over the other two.

As mentioned before, the ReLU activation function has a derivative of 0 when x ≤ 0 and 1 when x = 1, but we are using the LeakyReLU function where there is a slight positive slope (0.05 is used in the code) when x < 0. So, instead, the function has a derivative of that slight positive slope (which is 0.05 in our case) when x ≤ 0 and 1 when x > 0.

As for no activation function, the activation function is equal to z, and the derivative of z in terms of z is 1.

Let’s add these derivatives to the code:

    # Given the cache and other parameters, this method calculates
    # the derivatives needed to perform one step of backproagation.
    # As the derivatives are calculated, the weights and biases
    # are updated
    # Parameters:
    #   cache - Values from the forward pass which will be needed to
    #           calculate the gradients and update the weights
    #           and biases
    #   X_train - Inputs values used to train the model
    #   y_train - Output values used to train the model
    #   alpha - The learning rate which is the rate at which the
    #           derivatives effect the weights and biases.
    # Outputs:
    #   cache - Values from the forward pass as well as values
    #           from the backward pass.
    def backward(self, cache, X_train, y_train, alpha):
        # Starting at the loss function, calculate the partial
        # derivatve of the loss function
        cache["da" + str(self.numLayers-1)] = (-y_train/cache["a" +\
                     str(self.numLayers-1)])+((1-y_train)/(1-\
                     cache["a" + str(self.numLayers-1)]))        # Iterate through each layer starting with the last one
        # and calculate the partial derivatives needed to find 
        # the gradeints for the weights and biases
        for i in reversed(range(0, self.numLayers)):
            # Calculate the derivative of the activation function in
            # terms of the loss function if the layer is a
            # hidden layer
            if i != self.numLayers-1:
                cache["da" + str(i)] = np.sum(cache["dweights" + \
                    str(i+1)], axis=-2).reshape(cache["z" + \
                    str(i)].shape)            # Calculate the derivative of the activation function
            # in terms of z            # If the activation function is relu, then the
            # derivative is 1 if the output of the layer before the
            # activation function (zi) is greater than 0 or the
            # derivative is 0 if the output of the layer before the
            # activation function (zi) is less than or equal to 0. 
            if self.layers[i].activation == "relu":
                cache["dz" + str(i)] = np.where(cache["z" +\
                                            str(i)] <= 0, 0.05, 1.)
            # If the activation function is sigmoid, then the
            # derivative is given by the following formula:
            # (phat(1-phat))
            # where phat are the outputs (or predictions) for that
            # layer
            elif self.layers[i].activation == "sigmoid":
                cache["dz" + str(i)] = cache["a" + str(i)]*(1 - \
                                                cache["a" + str(i)])
            # If the activation function is nothing, then there is
            # no derivative to be taken, so it can be represented
            # as 1 for each output in the batch
            else:
                cache["dz" + str(i)] = np.ones(cache["z" + \
                                                     str(i)].shape)

With the code above, we can now calculate da/dz, but the goal is to have dL/dz. Recall that the way to do this is by using the chain rule.

Since we already have dL/da, all we need to do is multiply dL/da by da/dz to get dL/dz. Let’s add that to the code.

    # Given the cache and other parameters, this method calculates
    # the derivatives needed to perform one step of backproagation.
    # As the derivatives are calculated, the weights and biases
    # are updated
    # Parameters:
    #   cache - Values from the forward pass which will be needed to
    #           calculate the gradients and update the weights
    #           and biases
    #   X_train - Inputs values used to train the model
    #   y_train - Output values used to train the model
    #   alpha - The learning rate which is the rate at which the
    #           derivatives effect the weights and biases.
    # Outputs:
    #   cache - Values from the forward pass as well as values
    #           from the backward pass.
    def backward(self, cache, X_train, y_train, alpha):
        # Starting at the loss function, calculate the partial
        # derivatve of the loss function
        cache["da" + str(self.numLayers-1)] = (-y_train/cache["a" +\
                     str(self.numLayers-1)])+((1-y_train)/(1-\
                     cache["a" + str(self.numLayers-1)]))        # Iterate through each layer starting with the last one
        # and calculate the partial derivatives needed to find 
        # the gradeints for the weights and biases
        for i in reversed(range(0, self.numLayers)):
            # Calculate the derivative of the activation function in
            # terms of the loss function if the layer is a
            # hidden layer
            if i != self.numLayers-1:
                cache["da" + str(i)] = np.sum(cache["dweights" + \
                    str(i+1)], axis=-2).reshape(cache["z" + \
                    str(i)].shape)            # Calculate the derivative of the activation function
            # in terms of z            # If the activation function is relu, then the
            # derivative is 1 if the output of the layer before the
            # activation function (zi) is greater than 0 or the
            # derivative is 0 if the output of the layer before the
            # activation function (zi) is less than or equal to 0. 
            if self.layers[i].activation == "relu":
                cache["dz" + str(i)] = np.where(cache["z" +\
                                            str(i)] <= 0, 0.05, 1.)
            # If the activation function is sigmoid, then the
            # derivative is given by the following formula:
            # (phat(1-phat))
            # where phat are the outputs (or predictions) for that
            # layer
            elif self.layers[i].activation == "sigmoid":
                cache["dz" + str(i)] = cache["a" + str(i)]*(1 - \
                                                cache["a" + str(i)])
            # If the activation function is nothing, then there is
            # no derivative to be taken, so it can be represented
            # as 1 for each output in the batch
            else:
                cache["dz" + str(i)] = np.ones(cache["z" + \
                                                     str(i)].shape)            # Multiply the current dz value by the activation
            # derivative from the proceeding layer to continue
            # the chain rule
            cache["dz" + str(i)] *= cache["da" + str(i)]

Next, we need to calculate the derivatives of the weights and biases. These derivatives were explained in the last article, and using the chain rule we get the following code:

    # Given the cache and other parameters, this method calculates
    # the derivatives needed to perform one step of backproagation.
    # As the derivatives are calculated, the weights and biases
    # are updated
    # Parameters:
    #   cache - Values from the forward pass which will be needed to
    #           calculate the gradients and update the weights
    #           and biases
    #   X_train - Inputs values used to train the model
    #   y_train - Output values used to train the model
    #   alpha - The learning rate which is the rate at which the
    #           derivatives effect the weights and biases.
    # Outputs:
    #   cache - Values from the forward pass as well as values
    #           from the backward pass.
    def backward(self, cache, X_train, y_train, alpha):
        # Starting at the loss function, calculate the partial
        # derivatve of the loss function
        cache["da" + str(self.numLayers-1)] = (-y_train/cache["a" +\
                     str(self.numLayers-1)])+((1-y_train)/(1-\
                     cache["a" + str(self.numLayers-1)]))        # Iterate through each layer starting with the last one
        # and calculate the partial derivatives needed to find 
        # the gradeints for the weights and biases
        for i in reversed(range(0, self.numLayers)):
            # Calculate the derivative of the activation function in
            # terms of the loss function if the layer is a
            # hidden layer
            if i != self.numLayers-1:
                cache["da" + str(i)] = np.sum(cache["dweights" + \
                    str(i+1)], axis=-2).reshape(cache["z" + \
                    str(i)].shape)            # Calculate the derivative of the activation function
            # in terms of z            # If the activation function is relu, then the
            # derivative is 1 if the output of the layer before the
            # activation function (zi) is greater than 0 or the
            # derivative is 0 if the output of the layer before the
            # activation function (zi) is less than or equal to 0. 
            if self.layers[i].activation == "relu":
                cache["dz" + str(i)] = np.where(cache["z" +\
                                            str(i)] <= 0, 0.05, 1.)
            # If the activation function is sigmoid, then the
            # derivative is given by the following formula:
            # (phat(1-phat))
            # where phat are the outputs (or predictions) for that
            # layer
            elif self.layers[i].activation == "sigmoid":
                cache["dz" + str(i)] = cache["a" + str(i)]*(1 - \
                                                cache["a" + str(i)])
            # If the activation function is nothing, then there is
            # no derivative to be taken, so it can be represented
            # as 1 for each output in the batch
            else:
                cache["dz" + str(i)] = np.ones(cache["z" + \
                                                     str(i)].shape)            # Multiply the current dz value by the activation
            # derivative from the proceeding layer to continue
            # the chain rule
            cache["dz" + str(i)] *= cache["da" + str(i)]            # Now, calculate the derivatives for the weights, 
            # and biaseses. Rememebr to multiply all of these
            # values by the z value derivative to complete the
            # chain rule.            # Get the derivative of the weights. The derivative of
            # the weights is the corresponding input values
            # (a(i-1))*(1/m) dot the previous 
            # derivative values. In this case, the previous
            # derivative dz: values are dweights = a(i-1)*dz
            cache["dweights" + str(i)] = (1/X_train.shape[0])*\
                  np.array([x * cache["dz" + str(i)] \
                    for x in cache["a" +\
                    str(i-1)]]).reshape(cache["a" + str(i-1)].\
                      shape[0], cache["dz" + str(i)].shape[0],\
                      X_train.shape[0])            # The derivative of a bias is 1*(1/m), since the bias
            # is constant, dot the previous derivative values.
            # In this case, the previous derivative values are dz:
            # dbiases = 1*dz
            cache["dbiases" + str(i)] = 1*(1/X_train.shape[0])*\
                                        cache["dz" + str(i)]            # Correct the weights and biases by changing all nan
            # values to 0 and keeping the values between -1 and 1
            # to avoid exploding gradients
            cache["dweights" + str(i)]  = np.nan_to_num(\
                  cache["dweights" + str(i)], nan=0.0)
            cache["dbiases" + str(i)]  = np.nan_to_num(\
                  cache["dbiases" + str(i)], nan=0.0)
            cache["dweights" + str(i)] = np.clip(\
                  cache["dweights" + str(i)], -1, 1)
            cache["dbiases" + str(i)] = np.clip(\
                  cache["dbiases" + str(i)], -1, 1)

There are a few things to notice in this code block. The first is the .reshape functions in some of the calculations. The reshape function does not change the values of the data. All it does is change the shape of the data to a shape that can be multiplied without shape issues.

The next important thing to notice is that the derivatives of the weights and biases are averaged as they are multiplied by 1/N. Since we are doing gradient descent in batches (in this case it’s a single batch with all data) where we do calculations with multiple values at a time, we need to average the derivatives by the number of batches of derivatives we have. Otherwise, the derivatives will be too large and will likely “kill” the model.

The final thing to notice is that the derivatives of the weights and biases are changed and clipped. If a derivative is NaN, then we want to change it to a more useful value. In our case, we just change it to a 0. We also want to clip the derivatives in case they become too large. A derivative that is too large may “kill” the model or cause the model to take a massive step in the wrong direction which essentially causes it to start learning all over again. This method of clipping the gradients is called gradient clipping and is commonly used in deep learning.

Finally, we can update the weights and biases using the gradients by the following formulas:

The final code for the backward function is as follows:

    # Given the cache and other parameters, this method calculates
    # the derivatives needed to perform one step of backproagation.
    # As the derivatives are calculated, the weights and biases
    # are updated
    # Parameters:
    #   cache - Values from the forward pass which will be needed to
    #           calculate the gradients and update the weights
    #           and biases
    #   X_train - Inputs values used to train the model
    #   y_train - Output values used to train the model
    #   alpha - The learning rate which is the rate at which the
    #           derivatives effect the weights and biases.
    # Outputs:
    #   cache - Values from the forward pass as well as values
    #           from the backward pass.
    def backward(self, cache, X_train, y_train, alpha):
        # Starting at the loss function, calculate the partial
        # derivatve of the loss function
        cache["da" + str(self.numLayers-1)] = (-y_train/cache["a" +\
                     str(self.numLayers-1)])+((1-y_train)/(1-\
                     cache["a" + str(self.numLayers-1)]))        # Iterate through each layer starting with the last one
        # and calculate the partial derivatives needed to find 
        # the gradeints for the weights and biases
        for i in reversed(range(0, self.numLayers)):
            # Calculate the derivative of the activation function in
            # terms of the loss function if the layer is a
            # hidden layer
            if i != self.numLayers-1:
                cache["da" + str(i)] = np.sum(cache["dweights" + \
                    str(i+1)], axis=-2).reshape(cache["z" + \
                    str(i)].shape)            # Calculate the derivative of the activation function
            # in terms of z            # If the activation function is relu, then the
            # derivative is 1 if the output of the layer before the
            # activation function (zi) is greater than 0 or the
            # derivative is 0 if the output of the layer before the
            # activation function (zi) is less than or equal to 0. 
            if self.layers[i].activation == "relu":
                cache["dz" + str(i)] = np.where(cache["z" +\
                                            str(i)] <= 0, 0.05, 1.)
            # If the activation function is sigmoid, then the
            # derivative is given by the following formula:
            # (phat(1-phat))
            # where phat are the outputs (or predictions) for that
            # layer
            elif self.layers[i].activation == "sigmoid":
                cache["dz" + str(i)] = cache["a" + str(i)]*(1 - \
                                                cache["a" + str(i)])
            # If the activation function is nothing, then there is
            # no derivative to be taken, so it can be represented
            # as 1 for each output in the batch
            else:
                cache["dz" + str(i)] = np.ones(cache["z" + \
                                                     str(i)].shape)            # Multiply the current dz value by the activation
            # derivative from the proceeding layer to continue
            # the chain rule
            cache["dz" + str(i)] *= cache["da" + str(i)]            # Now, calculate the derivatives for the weights, 
            # and biaseses. Rememebr to multiply all of these
            # values by the z value derivative to complete the
            # chain rule.            # Get the derivative of the weights. The derivative of
            # the weights is the corresponding input values
            # (a(i-1))*(1/m) dot the previous 
            # derivative values. In this case, the previous
            # derivative dz: values are dweights = a(i-1)*dz
            cache["dweights" + str(i)] = (1/X_train.shape[0])*\
                  np.array([x * cache["dz" + str(i)] \
                    for x in cache["a" +\
                    str(i-1)]]).reshape(cache["a" + str(i-1)].\
                      shape[0], cache["dz" + str(i)].shape[0],\
                      X_train.shape[0])            # The derivative of a bias is 1*(1/m), since the bias
            # is constant, dot the previous derivative values.
            # In this case, the previous derivative values are dz:
            # dbiases = 1*dz
            cache["dbiases" + str(i)] = 1*(1/X_train.shape[0])*\
                                        cache["dz" + str(i)]            # Correct the weights and biases by changing all nan
            # values to 0 and keeping the values between -1 and 1
            # to avoid exploding gradients
            cache["dweights" + str(i)]  = np.nan_to_num(\
                  cache["dweights" + str(i)], nan=0.0)
            cache["dbiases" + str(i)]  = np.nan_to_num(\
                  cache["dbiases" + str(i)], nan=0.0)
            cache["dweights" + str(i)] = np.clip(\
                  cache["dweights" + str(i)], -1, 1)
            cache["dbiases" + str(i)] = np.clip(\
                  cache["dbiases" + str(i)], -1, 1)            # Update the weights and biases for this layer
            self.layers[i].weights -= alpha*np.sum(\
                cache["dweights" + str(i)], axis=-1).T
            self.layers[i].biases -= alpha*np.sum(\
                np.sum(cache["dbiases" + str(i)], axis=-1), axis=-1)        # Return the updated cache
        return cache

Finally, we can test out all the code we have written. I have created a function named main which has the following details:

The moons dataset is used to train the model
It uses a neural network with 2 inputs, 3 hidden layers, 16 nodes per hidden layer, 1 node in the output layer, a ReLU function for the hidden layers, and a Sigmoid function for the output layer.
It updates the model 20,000 times
The learning rate alpha is 0.25
A graph is shown for the real data (bottom left graph), the predictions on the data by the network (bottom right graph), and the loss over all updates (top graph) every 50 model updates
After 20,000 updates, the three graphs are shown for the test data, which is data the model has never seen

def main():
    # The learning rate used to update the weights and biases
    alpha = 0.25    # Get the data from a dataset
    data = sklearn.datasets.make_moons(noise=0.2, n_samples=200,\
                                      shuffle=False, random_state=0)
    
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(\
                                                data[0], data[1])    # Create the neural network with:
    # - numInputs of size 2 (an x1 and x2 value)
    # - 3 layers
    #   - 2 hidden layers of size 16
    #   - an output layer of size 1 (for the classification (yhat))
    # - A relu activation function for each hidden layer
    # - A sigmoid activation functin for the output layer
    model = NeuralNetwork(2, 3, [16, 16, 1], ["relu", "relu", \
                                              "sigmoid"])    # Holds all loss values and it's correpsonding step value
    # to plot the data
    loss_y = []
    loss_x = []    # Setup the graphs and variables needed to graph the data.
    gs = gridspec.GridSpec(2, 2)
    plt.figure()
    loss_plot = plt.subplot(gs[0, :])
    real_plot = plt.subplot(gs[1, 0])
    pred_plot = plt.subplot(gs[1, 1])
    plt.ion()    # For 20000 iterations, feed forward the training data through
    # the network and update the weights using the gradients
    # by going backwards.
    for i in range(0, 20000):
        #####################
        #Forward Propagation#
        #####################        # Feed forward the datapoints through the network to get
        # the predictions as well as a cache that holds values
        # calculated through the hidden layers
        predictions, cache = model.forward(X_train.T)        # Get the loss for the current predictions
        cache["loss"] = binaryCrossEntropy(predictions, \
                                           np.array([y_train]))
        loss_y.append(cache["loss"])
        loss_x.append(i)
        if i%50 == 0:
            print("loss at step " + str(i) + " = " + \
                  str(cache["loss"]))        # Round the predictions to a 0 or a 1
        predictions = np.round(predictions)        ##################
        #Back Propagation#
        ##################
        cache = model.backward(cache, X_train, y_train, alpha)        ###################
        #Plotting the Data#
        ###################
        if i%50 == 0:
            plt.cla()
            loss_plot.clear()
            real_plot.clear()
            pred_plot.clear()
            plt.tight_layout(pad=3)#, w_pad=0.5, h_pad=1.0)
            loss_plot.set(xlabel='step', ylabel='loss value')
            loss_plot.set_title("Loss = " + str(np.around(\
                cache["loss"], 5)))
            real_plot.set(xlabel='X', ylabel='y')
            real_plot.set_title("Real Data")
            pred_plot.set(xlabel='X', ylabel='y')
            pred_plot.set_title("Training Data Predictions")
            loss_plot.scatter(loss_x, loss_y)
            real_plot.scatter(X_train[:,0], X_train[:,1],\
                              c=y_train, cmap=ListedColormap(\
                                  ['#FF0000', '#0000FF']),\
                                  edgecolors='k')
            pred_plot.scatter(X_train[:,0], X_train[:,1],\
                              c=predictions, cmap=ListedColormap(\
                                  ['#FF0000', '#0000FF']),\
                                  edgecolors='k')
            plt.pause(0.0001)
            plt.show()    # Puase to show the ending results for 10 seconds
    plt.pause(10)    # Make the final predictions on the test set.
    final_preds, _ = model.forward(X_test.T)    # Print the loss value for the test set
    final_loss = binaryCrossEntropy(final_preds, np.array([y_test]))
    print("Final loss: " + str(final_loss))    # Graph the test data
    fig, (real_plot, pred_plot) = plt.subplots(2)
    plt.tight_layout(pad=3)
    real_plot.set(xlabel='X', ylabel='y')
    real_plot.set_title("Real Data (Top) v. Predictions (Bottom)")
    pred_plot.set(xlabel='X', ylabel='y')
    pred_plot.set_title("Test Loss = " + str(final_loss))
    real_plot.scatter(X_test[:,0], X_test[:,1], c=y_test,
                      cmap=ListedColormap(['#FF0000', '#0000FF']))
    pred_plot.scatter(X_test[:,0], X_test[:,1], c=final_preds,
                      cmap=ListedColormap(['#FF0000', '#0000FF']))
    plt.show()
    plt.pause(30)# Call the main function
main()

In the main function above, the model is trying to predict if a dot should be red or blue and basically separates the graph into sections. The goal is for the model to match the bottom left graph where the models’ predictions are on the bottom right graph. An example of the graph during training is shown below:

I’m not going to go into too much detail about how this function works as this article is over the neural network, not on graphing the data.

That’s all there is to coding a neural network using NumPy. It’s just a massive chain rule which updates the parameters in a way that will decrease the loss function. If you want to see the final project, you can find it on GitHub:

https://github.com/gmongaras/Visualizing_Gradient_Descent_For_BCE_Loss