How Do Neural Networks Work?
With Neural networks changing the world, I thought it’d be a good idea to start learning about them and teach others about these influential systems.
In this article, you will learn:
- How neural networks can make predictions through the forward feed method
- The importance of a loss function
- How neural networks learn through backpropagation
- How to code both the forward feed method and backpropagation in Python using a basic example
To get a basic understanding of the math behind a neural network, you will likely need to know a little linear algebra for the forward feed method and knowledge of derivatives (we will be using basic partial derivatives, but an understanding of a normal derivative should be good enough).
Forward Feed
The forward feed method is built upon dot products and nonlinear activation functions. Each neural network has a vector of inputs, some hidden layers, and an output layer. Note that the inputs may be confused as a layer, but since no calculations are involved, they are just considered a vector or matrix of data.
Each layer consists of many nodes. Notice, the first hidden layer has 4 nodes, the second hidden layer has 4 nodes, and the output layer has 1 node. Each layer can have as many nodes as you desire, but as the number of nodes increases, the computational power needed to train the network also increases. Also, notice how each node has multiple lines connecting that node to all nodes in the next layer along with all nodes in the previous layer. These lines are called weights which are the parameters we wish to tune in the neural network.
Before going into the formulas for the forward feed part of a neural network, let’s go over some notation:
- Xᵢ = The ith input into the node
- wᵢ = The ith weight going into the node
- b = The bias for that node
- z = The output of a node before the activation function is applied
- a = The output of a node after the activation function is applied
Below is a picture of a representation of a node. Don’t worry if you don’t understand what’s happening as we are going to break it down step by step.
The node above has 3 inputs (x values), meaning it also has three weights. You can imagine this node being part of some hidden layer with the previous hidden layer containing 3 nodes. Remember, a node is connected to all nodes preceding it which is why we have 3 inputs. Each input has a corresponding weight which will be used to calculate the z value, the first computation in a node. The formula below can be used to calculate the z value of the node shown above:
So to calculate the z value, each input is multiplied by its corresponding weight. All these values are summed up and then added to a bias. After the z value is calculated, an activation function is applied to the total, giving us the final output from a node. Some activation functions commonly used in hidden layer nodes are ReLU, Tanh, or Sigmoid while some activation functions commonly used in the output layer are Sigmoid or Softmax. I will use ReLU for the examples in this article as it is easy to differentiate. Below is what a ReLU function looks like when graphed:
A ReLU function takes the maximum between 0 and the z value as the function specifies below:
After the a value has been calculated, the node computation has been completed and this value can be passed on to proceeding layers. The a calculation that we just computed is only for one node in a layer. Layers tend to have more than 1 node so you can imagine making this computation multiple times for every node in a layer.
That’s all there is to the forward feed part of a neural network. The network is just built of layers which are made of nodes, and each node follows the formulas specified above.
For better computation speeds, dot products are used to make the z calculation for an entire layer as opposed to a single node as per the formula below:
So, we can calculate the z values for a layer from the dot product of the inputs and the transposed weights (The T superscript means transpose the values) plus the bias for each node. Then a ReLU function can be applied to those z values. Below is an example that you code in Python using numpy:
import numpy as np# In this example, the previous layer has 4 nodes which is
# why the input has 4 values.
# The layer we are calculating has 3 layers but each node in
# this layer has to take in 4 inputs. This is why the
# weights array is a 3x4 array (3 nodes, 4 inputs)
# There are 3 biases, 1 for each node in this layer.
inputs = np.array([1,2,3,4])
weights = np.array([[1, 0.5, -0.5, 1],
[0.5, -1, 1, -0.5],
[0, 1, -1, 0.5]])
biases = np.array([10,9,8])# Calculate the z value
z = np.dot(inputs, weights.T) + biases# Calculate the activation value
a = np.maximum(0, z)
That’s how the forward feed method works. The following section talks about how to analyze the performance of the network.
Loss Functions
The performance of a network is computed by using a loss or a cost function through which the outputs of the network are sent. The loss function evaluates how good the network did when making predictions on data. This function is essential for the backpropagation process which we will go over in the next section.
A loss function can be any function you want it to be, but a suitable loss function will return a low value (one which is closer to 0) the closer the neural network is to reaching your goal and a high value if the network was far from reaching the desired goal. Note: for most loss functions, the lowest loss value is 0 while the highest loss value is -inf or inf. For example, suppose I wanted to classify an image as being a cat or a dog. This task is a binary task where we could represent a 1 as being a cat and a 0 as being a dog. So, we want a neural network with a single output node which gives us a value between 0 and 1. If the output value is greater than or equal to 0.5, we can say the network thought the image was a cat, and if the output value is less than 0.5, we can say the network thought the image was a dog.
To find the loss of this network, we will send the output (which is between 0 and 1) through the BCE (Binary Cross Entropy) Loss function, which is shown below:
First, let’s go over the notation for this function:
- N = The number of training examples
- yᵢ = The true label
- ŷᵢ = The predicted label (what the network predicted)
There are three main parts to this loss function, and we will break each part down. The first part is shown below:
This term multiplies the true label by the log of the predicted label. So let’s say we wanted to predict that an image was a cat. Then yᵢ would be 1 since we represent a cat as a 1 in the example described above. So if our network predicted that the image was a cat with 100% certainty, then ŷᵢ would also be 1. Since the log of 1 is 0, this term will become 0.
If our network predicted that the image was a cat with a value of 0.75, then ŷᵢ is 0.75. The log of 0.75 is about -0.28, so this term becomes -0.28.
If our network predicted that the image was a dog with an output of 0, then ŷᵢ is 0. The log of 0 is negative infinity, so this term becomes negative infinity.
Notice that this term becomes more negative as the predicted value, ŷᵢ, shifts further from the true label of 1. So, this term decreases the loss by a factor of how far away the prediction is from the actual value.
If yᵢ is 0, meaning we want the network to predict that the image is a dog, then this term is always 0.
The next term is as follows:
This term looks very similar to the last one, but the y values are subtracted from 1. In the previous example, yᵢ was 1 since we wanted the network to predict that the image was a cat. For this term, if yᵢ is 1, then the term will always be 0.
If yᵢ is 0, then this term will follow how the previous term worked when yᵢ was 1 but in the opposite direction. So if ŷᵢ is 0, then the log value is 0 since the log of 1-0 is 0.
If ŷᵢ is 0.75, then the log value is log(0.25) which is about -1.38
If ŷᵢ is 1, then the log value is log(0) which is negative infinity
As you can see, when the true label, yᵢ, is 0, the term decreases as the predicted value gets further from 0.
Combining the two terms, when the true label (yᵢ) is 1, the first term (yᵢ ∙ log(ŷᵢ)) is active. In this case, as ŷᵢ gets closer to 0 (closer to the wrong label), the sum of the two terms gets closer to negative infinity. When the true label (yᵢ) is 0, the second term ((1-yᵢ) ∙ log(1-ŷᵢ)) is active. In this case, as ŷᵢ gets closer to 1 (close to the incorrect label), the sum of the two terms also gets closer to negative infinity. The range of values these terms can give are [-∞, 0] where 0 means ŷᵢ=yᵢ and -∞ means ŷᵢ=(1-yᵢ).
So these terms are basically saying as the predicted value, ŷᵢ, gets further from the actual value, yᵢ, the loss gets closer to negative infinity.
The final term is as follows:
When training a neural network, you don’t want to give the network a single example and have it learn off that one example. You want the network to learn from several examples. This is why there is a summation from i to N where i starts at the 0th example and ends at the last Nth example. Then after summing all the loss values up, we take the average (which is why the 1/N term is there).
Remember that this loss function can be in the range from [-∞, 0] where -∞ means the network performed as badly as possible and 0 means the network performed as well as possible. The problem is that a negative infinity is very unintuitive as more negative values seem to indicate the network is performing well when in reality, it is performing terribly. Remember that 0 is the “low” value in a loss function while -∞ and ∞ are the “high” values in a loss function. So, the negative sign flips the range around to [0, ∞] making it much more intuitive. After applying the negative sign, the network performs worse when the loss is a larger positive value and performs really well when the loss is close to 0. The negative sign doesn’t do anything important to the training of a neural network, but it makes the loss function a lot more intuitive and easier to debug.
The BCE loss function is one of many. In fact, any function can be a loss function, but that function may not be representative of what you want the network to learn. In the next section, we will go over a basic overview of how backpropagation works.
Backpropagation
Backpropagation is the process of updating the weights in a neural network slightly to help it perform better and hopefully learn the desired task. Instead of going from left to right as we did in the forward feed process, this process goes from right to left through the network and comes after the forward feed method has been completed.
Going back to the loss function that was described earlier, the goal of backpropagation is to minimize the loss as much as possible, and hopefully, it will find a global minimum on that function. This procedure is done through the use of the chain rule with partial derivatives.
Let’s start with a visualization of a basic neural network with 2 inputs and 1 output:
The diagram above may look a bit strange, but that’s because the z and a calculations are broken up into different nodes. For simplicity, the loss is equal to the z value, but the same math applies. The forward feed has already been calculated where the loss ends up being 4.
The goal of backpropagation is to take the partial derivative of the weights and bias in terms of the Loss value and then update the weights and biases using their derivatives. We do this by using the chain rule. To start, let’s take the partial derivative of the loss with respect to the loss:
Of course, this value equals 1, and since the derivative of the loss with respect to the loss is always 1, this value is not going to be used when deriving other variables. Continuing backward, let’s calculate the partial derivative of a with respect to the Loss.
Yet again, this is 1. Since the loss function is just a, the partial derivative of the loss function with respect to a is just 1. Unlike the previous derivative of the loss, this value is important even though it is just 1. Imagine the derivative of the BCE loss with respect to ŷᵢ. That value is not just 1 and affects all other derivatives.
Now let’s compute the derivative of the activation function which is ReLU. The derivative of ReLU is very easy as it’s 0 when z<0 and 1 when z>0. When z is 0, the derivative is undefined, so generally, we say the derivative at that point is 0. Below is the updated graph with the derivative of the Loss with respect to the z value:
Since the value of z is 4, which is greater than 0, the derivative of the z value is 1. But that’s just the derivative of the z value in terms of the activation function. What we want is the derivative of the z value with respect to the Loss function. To do this, we use the chain rule. All we have to do for the chain rule is multiply the previously calculated derivatives by the new one to get the derivative of the loss with respect to the z value. The final value is still 1, but that is because the derivative of the activation function is 1.
Now let’s calculate the derivative of the weights and biases. Recall the formula of the z value in this example is as follows:
Below are the derivatives for all values in this function:
One important note to make is that there is no reason to update the x values since the x values are the inputs to the network which change for each iteration. The importance of the derivative of x comes from computing derivatives in between layers which is something we will go over in the next article. On the other hand, the derivatives of the weights and bias are what we want to use to update the network. Using the chain rule, we get the following derivatives of the loss function in terms of the x, w, and b values:
Let’s update the diagram with these new derivatives:
Now that we have the derivatives of the weights and biases, we can update the weights and biases. Since we want to minimize the loss, we subtract the values of the weights and biases by their derivative. In practice, the derivatives of the weights and biases are multiplied by a constant named α (alpha), which is used to make a more stable update. α usually ranges from some value greater than 0 to 1. This way, we reduce the amount the weights and biases are updated. In this case, let’s use α = 0.1. Below are the formulas used to update each parameter:
So let’s update the diagram with the new weights and biases:
You’re probably wondering why we did all that work for a little update to the weights and biases. This update may look small, but remember we wanted to minimize the loss. Let’s see what the loss is now:
Before the update, the loss was 4, but after the update, the loss is now 1.9. If we do another update using the new loss value, we will find that the loss decreases again and will continue doing so after every consecutive update. This is why backpropagation is so useful and is why it is so good at optimizing a loss function.
The calculations we just did are for a single node in a layer, but the same math applies to all other nodes in all other layers.
Below is some Python code that updates the network 100 times. Notice how the loss decreases and approaches 0 which is the lowest this loss value can be.
import numpy as np
# The parameters
x1 = 2
x2 = -4
w1 = 0.5
w2 = -0.5
b = 1
alpha = 0.1# Loop 100 times which is the numebr of times
# an update will happen to the network
for i in range(0, 100):
# Forward feed
z = x1*w1 + x2*w2 + b
a = max(0, z)
Loss = a
print(f"Loss at step #{i+1} : {Loss}")
# Backpropagation
dLoss = 1
da = 1
dz = (1 if a >= 0.5 else 0)
dLoss_dz = dz*da dx1 = w1
dw1 = x1
dx2 = w2
dw2 = x2
db = 1
dLoss_dx1 = dx1*dz*da
dLoss_dw1 = dw1*dz*da
dLoss_dx2 = dx2*dz*da
dLoss_dw2 = dw2*dz*da
dLoss_db = db*dz*da
# Update the parameters
w1 = w1 - alpha*w1
w2 = w2 - alpha*w2
b = b - alpha*b
That’s the basic idea behind a neural network (specifically a multilayer perception (MLP)). There are many other types of neural networks like a convolution neural network (CNN) which deals with images or a recurrent neural network (RNN) which deals with text, but the model we just went over is the basis of almost all other neural network models.
In the next article, we will code a neural network that splits points in a 2-D space into two colors using some training data with numpy.