Combining Individual Neurons Into A Feedforward Neural Network
How to implement a modular feedforward neural network and let it learn with back propagation
If you have been following along, we have laid a lot of groundwork for understanding neural networks by simply modeling a simple neuron. We even wrote the code to do so, which will be extended into a robust neural network library you can use to prototype and learn.
The next step is to take this mental model of a single neuron, and see how we can chain them together into a neural network. Defining our neurons and error functions in a modular way, actually makes it pretty easy to extend and learn on larger data.
Let’s zoom out and look at a neural network with one hidden layer again.
This particular neural network has 2 inputs, 2 bias terms, 4 hidden units, and 2 outputs. What benefit does all this extra complexity gives us over a simple linear model?
Well the fact of the matter is not all datasets in the real world are perfectly linear. In fact, most are not. There are usually complex relationships between variables in our world, that cannot be modeled by the sum of their parts.
Let’s take a silly example to build up our intuition about why neural nets can solve more complex problems.
This example involves a mythical creature you may recognize as a minotaur.
Minotaurs have quite the interesting quality of being half human, and half horse. Fun fact - they also are known to live in labyrinths - and are not real. Likelihood of seeing one in the wild aside, let’s say we wanted to teach our simple linear neuron about this mythical creature. The task at hand is we want to be able to distinguish between a man, a horse, and a minotaur, given some information about them.
The problem laid out on a graph looks like this:
There are two input variables here. The x-axis represents if the creature has a human body. At x=1, the creature has a human body, and at x=0, the creature does not. For simplicity purposes, there is no in-between. The y-axis represents whether the creature has horse legs or not. y=1 means the creature has horse legs, and y=0 means they do not.
This leave us 3 points on the graph representing whether the creature is a human, a horse, or a minotaur. This may be a very simple dataset, but the three classes are not linearly separable. Meaning we cannot simply draw a straight line that separates the three classes. There is a non-linear relationship between our input variables.
We could try to draw a line to separate some of the data. For example one that decides whether the input is a minotaur or not. This is fine, but it would not be able to distinguish between horse or human.
Or we could choose a line that decides between human or not human, but cannot decide between horse or minotaur.
Or finally a line that decides between horse or not horse, but cannot distinguish between a human or a minotaur.
There is not a single straight line that could separate all three.
Let’s extend our linear model from before to handle multiple inputs, and see why this may be.
Here we have added an extra input and an extra weight, and added them to the summation. This is a more common form of the linear regression we saw earlier. Linear regression generalizes to many input parameters by adding another weight for each new input.
We call each input to the model a “feature”, because they describe some fact about the data. In the minotaur case, we need the model to know whether both features are on at the same time or not. We need to know if the creature has both a human body and horse legs, or just one of either.
This linear model cannot assign a weight to “having both features on at the same time” because the features are separate, and each have their own separate weight. It would be ideal if we could learn a new feature that indicates whether both input features are present or not, then assign a weight to this feature to tell us if it is important or not. We as machine learning engineers do not want to decide whether features are important, we should leave this up to the model.
This is where our hidden layer comes into play. A hidden layer is a set of neurons between the input and output that are connected and run through a non-linear function. The hidden layer can learn the dependencies between the input features, and assign weights to these dependencies. Let's make a new model that reflects this.
In our new model, we can see that there are two more neurons at play here, h_0 and h_1. There are also many new weights in-between the neurons (6 total to be exact).
This model has the ability to combine information from both of the input neurons, pass them through the hidden layer, and use it’s last set of weights to make the final decision. Now the hidden layer can see if both inputs are on at the same time, aggregate this information, and pass it on to the next layer.
In order to implement this type of architecture, let’s jump back to the different components we have learned about in previous posts. Trust me, they will come in handy again. I didn’t make you go through all that tedious math for nothing ;)
Remember Lisa and Edgar? They had some simple jobs that we laid out in a previous post. Lisa’s job was to take the linear combination of inputs and sum them up. Edgar’s job was to calculate an error.
We need to add a new character to the mix for our more complex model to work. Let’s call him "Alex the Activation Function".
Each of our characters are going to have two jobs. They first need to calculate how they want to pass signal forward to get the right answer. Then they need to calculate how to send signal backward to correct mistakes.
You can think of these characters as modules in our program that we can chain together. In the end, these three modules will be all we need to construct arbitrary size and depth neural networks.
Let’s start with Lisa again. In our simple example, we had one input, and one output.
In a more complex version of Lisa, we are going to change her to have an arbitrary number of inputs that compute an arbitrary number of outputs.
She will take a set of inputs, x_0 - x_n, and take their weighted sum to compute a set of outputs O_0 - O_m.
In this case she has 2 inputs, 2 outputs, and a bias. The number inputs and outputs are the same here, but this does not have to be the case. I could have 2 inputs and 5 outputs or 6 inputs and 3 outputs. It depends on how many input features you have, and how many features you think the model needs to learn in an intermediate calculation. Coming up with the exact number of features for a hidden layer is not an exact science. Often we kind of guess based on how complex the data is, and validate our guess at the end. More on picking the size of hidden layers in a later post.
Lisa’s new calculation ends up just being a matrix multiplication of the input and the weights. Assume x is the input matrix, and w is a weight matrix, the output O would be calculated by:
If you are not familiar with how matrix math works, to get a single output O_ij you simply multiply each value in the ith row by the corresponding value in the jth column, and sum up all the values. This is the same linear function we had with a single variable, just extended to work with more inputs and outputs.
Next let’s introduce our new character, Alex Activation. He is going to take the output of Lisa, and decide how much signal gets passed through to the next set of neurons.
There are few different forms Alex could take, but to keep the math easy and intuitive, we are going to have him be what is called a “Rectified Linear Unit” or ReLU for short. The ReLU function is just a thresholding function that is 0 if the input is negative, and linear if the input is positive.
Alex will compute this function for each one of Lisa’s output, making it so either the signal is passed through unchanged, or zero signal is passed through. You can think of this as a switch on a set of train tracks. It looks at the value coming through, and if it is positive, let it through, and if it is negative, do not let it through. The weights from the previous neurons will be trained so that it will route information accordingly.
You may remember that these modules also need to calculate their derivatives in order to pass adjustments with respect to the error back from previous layers. The derivative for the ReLu function is pretty straight forward.
It is simply 1 if the input is greater than zero, and zero if it is less than 0. Think about the slope of our function, and rules for derivatives, and this should make sense.
Our last module, Edgar the Error Function, will actually stay exactly the same as we had him before. He takes the output from Lisa and computes the error.
The biggest difference from our simple model earlier, is how we are going to connect all these modules. We are going to stack them one by one until we get to Edgar.
Above is a more verbose picture of our neural network from before, with our characters hovering over their appropriate modules.
The math for the forward pass is pretty straight forward. I am going to lay down a lot of equations for the rest of the post, but refer to our diagram from above to see where all the variables come from. For the first layer, each output O_i will be calculated as a weighted sum of the weights multiplied by the input.
For the hidden layer, the values will be calculated via our non-linear activation function:
The output y_pred will be another linear transform:
The error function will take in the target value y_real, and compute the squared error:
We then back propagate the error through to update the weights. The first derivative is the error with respect to the predicted output:
Then we need to know what the influence of each weight in the second linear layer was on the prediction. We know that the derivative of linear functions are just equal to the value they multiplied by.
Since these are functions of functions, chained together, we need to use the chain rule. This means we simply multiply these partial derivatives by the error derivative to get the gradient for the set of weights in the second linear layer.
This gradient will be multiplied by the learning rate then subtracted from our existing weights to get our new set of weights for the last layer.
I know this is turning into a wall of equations which makes most people eyes glaze over... So here is a photo of a puppy in a cup to revive our energy.
Isn't he so cute? He thinks you can make it through all this math and make it to the end. Thanks little guy! But terms of the math, we are just about where we left off in the last post about linear models. This leaves us about halfway through our graph.
We have calculated the new weights in the second linear layer, and are ready to calculate the derivatives for the hidden layer.
There are no weights to update in the hidden layer, but we do need to apply the chain rule before passing the derivative of the error with respect to O_i to the next layer.
This first layer has 6 different weights which we will need to calculate derivatives for and update. Hopefully you have figured out the pattern. We calculate the derivative of the output with respect to the input, then multiply by the gradient from the previous layer to get the gradient with respect to the error. The gradient from each layer propagates back and can update the weights accordingly. This is why we call it back propagation.
Each partial derivative in the first linear layer is simple, it is just the input.
To get the gradient with respect to the error, we again use the chain rule.
And again we update the weights given the learning rate and this gradient (note this is for each of the 6 weights, I just didn’t want to write it down 6 times and used “i” instead)
There are a lot of equations here, but the pattern is consistent. I know you have been dying to see the minotaur problem solved, but at this point, it will be easier to write the code than to work out an example with a real dataset by hand.
In the next post we will take the minotaur problem from above, and modify our existing codebase to solve it with a small neural network. The equations here will be a good guide, but as long as each module knows their forward equation, and backward derivative, we will see the chain rule work it’s magic.