Writing The Code For A Feedforward Neural Network
The art of modular back propagation in C++
In this post we are going to take the code we wrote from our linear regression experiments (which you can find here) and extend it to work for more generalized neural networks. As explained in the last post we are going to solve the non-linear problem of classifying minotaurs, humans, and horses, with a feedforward neural network.
As a refresher on what the data will look like, take a look at this graph:
Here we have 3 points - human, horse, and minotaur, and 2 features - "has human body?" and "has horse legs?". The problem is, we simply cannot separate these 2 points with a linear model, given these features. Knowing that it is a minotaur requires the knowledge that both input features are on, which is not encoded in our feature set.
Sure, we could add more features to make the distinction boundary with a linear model, but this seems like a waste since we already have all the information we need with the two features we already have. A better solution is to pick a model that can derive higher level features, and learn the distinction itself.
This is where our feedforward neural network comes in. We are going to code up a very small one, but in a modular way that can be extended to larger neural nets in future posts. Our small but effective neural net will look like this:
We've already laid out the algorithm for this little guy in the last post, so let's hop right on into the code. If you did not write the code last time, or want a fresh starting point, here is the code on github you can download as a starting point https://github.com/curiousinspiration/Linear-Regression-With-Single-Neuron
The first thing we are going to do is refactor a little bit. We currently have all our classes defined in our main.cpp file, which is fine for small examples, but will get a little messy as we extend our codebase. Let's add a little directory structure, and a CMakeLists.txt file to get us going.
Create your directory structure as follows:
mkdir -p include/neural
mkdir -p src
mkdir -p tests
mkdir -p tools/feedforward_neural_net
We will put all of our headers inside include/neural, our implementations in src, unit tests in tests, and finally tools that use our library in, you guessed it tools.
We are going to get rid of build.sh and replace it with a more robust CMakeLists.txt file. Create this file in the same directory as you ran the commands before. Your directory structure should look like this:
Now let's edit our CMakeLists.txt file and add the following.
This CMake file will look in our src directory for all the .cpp files, compile them into a library, then link the library to a tool called feedforward_neural_net. It requires a few extra dependencies, mainly GLog and GTest from Google. If you want to know more about why I use this directory structure, or how to install GLog / GTest, check out this blog post.
Assuming you know the basics about CMake, or have read the blog post I just listed, let's start ripping out code from our old main.cpp file, into a nice reusable library. The first module we are going to add is our LinearLayer from before. Right now the linear layer only takes one input, and computes one output, but let's make sure we can compile and run it before we get ahead of ourselves.
Create a new directory in our include/neural directory called layers, and add the file linear_layer.h
Inside this file, we are going to take the definition of our linear layer from the linear regression example, and move it to the this header file.
You will notice the only change we have made here from before, is adding a namespace around the class definition. It is always a good idea in C++ to use a namespace around your classes, to ensure there are not conflicts when it comes to compile time. You will also have to add the std:: prefix to our vectors, since it is not a good idea to pull namespaces into header files.
Next we are going to add the .cpp implementation, again, stealing it from our old main.cpp file.
We wrapped our implementation in our namespace, and we should be good to go. Let's create a new main.cpp file that uses our new library.
This should be enough to test out our CMakeLists.txt file, and start running our code. To build the project:
Give cmake the path to your gtest build directories:
cmake -D GTEST_INCLUDE_DIR=~/Code/3rdParty/googletest-release-1.8.0/install/include/ \ -D GTEST_LIB_DIR=~/Code/3rdParty/googletest-release-1.8.0/install/lib/ ..
And now you should have an executable you can run, that will give you a prediction given a single input.
We did not initialize GLog in our main.cpp file, which causes everything to be written to stderr, but this is fine for now. If you look on the second line of the output it will tell you the prediction, as well as some timestamps and which file the log came from. Good start!
Now that we can compile and run our new project, let's port over the rest of our code, starting with our Error module.
Create a new directory to hold headers for error functions called loss, and add the error function header file
In neural networks the term "error", "loss", and "cost" are often interchangeable in the literature. Most popular neural network libraries use the term loss, so we are going to follow this precedent. As our library matures, you will see there are many different loss functions one can use, which is why we are making a separate directory to organize them into.
Now add the code from earlier for our squared error loss function.
Then add the forward pass to our main.cpp to make sure it still compiles and runs:
Now we should see the calculated error printed to our console as well.
Great! We have separated out our modules into a library we can link to different tools and tests. But we have not yet made any progress on extending our code to multiple variables, or a neural network. Let's start by modifying our LinearLayer class to take multiple inputs and produce multiple outputs.
This is how we modified Linear Lisa in the last post.
Let's take this knowledge and turn it into code. It is always good to write unit tests for new functionality before adding them to your main loop, so we are going to work through the rest of this code in a test driven development fashion. Some people roll their eyes at this style of development, saying it is extra work and takes too long, but I usually find it saves time in the long run, and think of it as writing a bunch of little "main" functions, that I would have written anyways to print what is happening to the console.
GTest is our testing framework of choice for these tests, so let's add it's main file, and our first test.
Inside tests/main.cpp we will add the standard GTest main function which will scan for all our tests and make an executable out of them.
Before we write any tests, let's create a simple typedef for a Matrix, that is simply a vector
Create a file called typedefs.h, and place it in the base include/neural directory.
Pretty simple. We prefix Matrix as TMatrix to indicate that it is a typedef, and that if you would like to know the actual type, look in typedefs.h.
Now onto our first test. I like to write the test first, before changing the interface to our class, as it is a way to decide what we would like the interface to look like as an external user, not as someone who is writing the LinearLayer class. Add a file in tests called tests/linear_layer_test.cpp
This test should be pretty straight forward. We want to be able to compute a forward pass on the input data, given some weights. The constructor for our linear layer will allow us to pass in our initial weights, which it will use to multiply against the input to create the output. We are ignoring the bias term for now to get the most basic example up and running, but will add it back in later.
Our code will no longer compile, because we have defined a new interface for our linear layer. This interface should still work for our old example, and is extended to work with a variety of data. The next step is to modify our header and cpp file to conform to this new interface, and properly compile. Go back to the header and modify it as so.
Instead of using floats, and vectors of floats everywhere, we have changed all of our inputs, outputs, and member variables to matrices. For our single neuron example, these matrices will be 1x1, but now we will be able to operate with more variables. Onto the cpp.
In this step we are just getting the code to compile and not yet pass the unit test, so all we have done here is fill out the stubs with no implementations.
Finally go into our tools/feedforward_neural_net/main.cpp file and delete the code we had previously. We are going to have to rewrite this for our new interface after we test it out.
Now compile the code, and run the test
Guess what? Our test failed miserably! But that's okay :) this is exactly what we expected. We just wanted to get the code compiled and running again after the interface change.
In order to make this test pass, we are going to have to understand fully how matrix multiplication works. If you already understand it, feel free to skip to the implementation below. If you don't know, or want a refresher, let's start with a little 2x3 * 3x2 example.
Assuming "i" represents an index into a row, and "j" represents an index into a column, to get the (i, j) element when multiplying two matrices, you walk down the row i, and column j, and take the product (multiply) each pair of elements, then sum up all the products, to get the new value at (i,j).
I have highlighted the first row of matrix A and the first column of matrix B as an example of how to calculate element C[0,0] in the result matrix C. We first multiply the element in A[0,0] of the first matrix by the element B[0,0] in the second matrix to get (5*4). Then we move along the respective row and column to find the next values A[0,1] * B[1,0] = (3*9). Finally the last element in each row / column A[0,2] * B[2,0] = (-1*0). We now sum up all these products to get the final result (5*4) + (3*9) + (-1*0) = 47.
The same process applies to calculate C[1,0], except now we would take row 1 from matrix A and column 0 from matrix B. You can see the full calculations at the bottom of each box in the result matrix and try to follow along.
Notice that the number of columns in matrix A must match the number of rows in matrix B, and that the resulting matrix will take on the number of rows in matrix A and the number of columns in matrix B. For me it is easier to see visually when you look at the dimensions of the matrix.
Now the values at the bottom of our test should make a little more sense. I laid out our calculations in a comment above, and test for each of these values in our output matrix.
Now let's try to implement our LinearLayer::Forward method to make this test pass, using the algorithm laid out above.
In our implementation, first we make sure that the sizes line up for a valid matrix multiplication, and throw an error if they do not. We are using stringstream to add the integer values into our error message, so we need to add this include at the top:
Next we initialize our return matrix with the outer sizes of our input matrices. You may notice the convention of using "l_", "a_", and "m_" at this point. This is to distinguish between local variables, arguments, and member variables, which helps me stay away from some simple bugs.
Finally we perform the matrix multiplication. If you are not sure how this all works, see if you can walk through this code and line it up with the algorithm we described earlier.
If we build and run our code again, we will see beautiful green color saying our code has passed! Woohoo!
Pat yourself on the back, you have your first passing unit test for our neural net library. And it is an important one too. Matrices, and matrix multiplication are a big part of what makes neural networks perform their magic. People often joke that there is no way deep neural nets can take over the world, all they are doing is simple matrix multiplications. While this is a bit of a simplification, we have just implemented a core function, and can now use it to take over - well - not the world - but at least solve some simple problems.
You may be wondering about our bias neuron from before, we kind of left this out of the equation, so that we could focus on matrix multiplication. Let's add it back in, and see how it works with our new matrix paradigm.
If you remember, the bias term adds a new weight, which is always multiplied by the input 1.0. In fact, if you have more than 1 output, it adds 1 weight per output that it multiplies by 1. In matrix terms, this means if we want a bias, we can simply add a new row of weights to our weights matrix, and a new column of 1's to our input.
Andrej Karpathy explains it well in his intro to linear classification for CS231 here. This course has a lot of good content outside this "bias-trick", and I highly recommend it for people interested in the field.
Photo creds: http://cs231n.github.io/linear-classify/
If you think about how matrix multiplication works, this should make sense. The new x_i = 1 in the input, will be multiplied by each one of the new weights we have in our model, letting the model learn a bias term for each one of the linear combinations.
We are organizing our inputs and weights transposed to that of the diagram above, but the same concept will apply. Let's use our boolean flag in our constructor to indicate whether or not we want this layer to have a bias, and implement the test for it.
We have changed hasBias to be true, and changed to expected output to all be increased by 1 because of the extra row and column. This test will fail until we update our LinearLayer class, so let's not delay and get right to it.
First, let's add a member variable to hold onto if we have a bias.
Then in the constructor, if we do have a bias, add and extra row of 1s to the weights.
Finally, in the forward pass, we will again check if we have a bias, and if so, add an extra column to the input.
We made a local copy of the input so that we can modify it, then added an extra column of 1s if there should be a bias. Don't forget to update the code below with the new local variable l_input instead of the argument a_input. Run the set of tests again, and tada! 2 tests up and running. We are cruising now!
Let's move on to the next step in our forward pass, so we can finish it all the way out before taking on the backwards pass. Hopefully you remember from the last post that the linear module (Lisa) passes her output through a non-linear activation (we called him Alex).
The activation function we chose was a Rectified Linear Unit, or ReLU. This will be a pretty simple module to implement. The ReLU function looks like the following:
For our forward pass, we are simply going to apply this function to each of our values in our matrix, and pass it on to the next module. Let's write a test for what we expect this to look like. Go ahead and open a file called tests/relu_layer_test.cpp
Pretty straight forward, if the value it greater than 0, let it through, if not, zero it out. Onto the header and implementation. We will add our header at include/neural/layers/relu_layer.h.
This is a similar interface to our linear layer, and not by coincidence. We will see by the end of this exercise that all of the modules we need follow a relatively similar interface. The main difference is some modules have trainable weights, and others do not.
Now for the implementation. Add src/relu_layer.cpp.
We include the
Compile and run to see our third test pass!
Now we should have all the tooling we need to see our forward pass in action. Let's head back to tools/feedforward_neural_net/main.cpp and chain our modules together to see the output.
First we define out 3 layers, with a linear layer to start, non-linear activation layer (ReLU), and a linear layer at the end to make the final prediction. This ends up being all the code needed for a forward pass through the model we defined before.
We passed in 4 random weights for the first linear layer which represent the edges between the first Lisa and Alex.
The linear layer under the hood has created our bias terms and bias weights, since the default parameter for the linear layer constructor is to do so.
Then the second layer is defined as a ReLU layer, which has no weights, but will run the outputs of the first layer through a non-linear activation function.
The third layer is another linear layer, this time with two random weights and an internal bias.
Lastly we have the error module which looks at the target output, and the predicted output, and calculates error.
We then define one training example, with two inputs, and one target output value.
And finally we run the data through our network to get the predicted output.
When you build and run this new main function, you should see a random guess, with a relatively high error.
This means we are about halfway done! We have all our interfaces and classes defined that we need. Now we simply have to implement the backward pass by having each module calculate it's gradient in the Backward() function, then apply the gradients to update the weights.
Let's start with defining the gradient for our error function, since it is the first module that will be run in the backward pass. If you remember from our last post, the error function is defined as:
Meaning the derivative of the error function:
Since we do not yet have a test module for our SquaredError class, let's create one to make sure the forward and backward passes work as intended. Open up tests/squared_error_test.cpp
These tests simply follow the equations laid out above. The first one calculates the error given and output and target. The second one calculates the derivative of the error with respect to the input prediction that was made.
We already implemented these forward and backward passes when we did the simple linear regression from before, so compiling and running this code should just work. One trick when using the gtest is you can run a subset of the tests by using the --gtest_filter parameter. It takes in a regex of test names you want to run as input.
Great, the first step of backprop is finished. Now onto our linear and relu layers.
We have already declared the interface for the backward pass in our LinearLayer class, we just have to implement it now.
Each layer is going to have the same interface to make building larger networks easier. They will take their original inputs and the gradient from the next layer as input, while returning their own gradient of the output with respect to the input. When it gets to our main function, we will be able to chain these backwards calls from each layer together, passing the output from one, into the input of the next.
Not only will we want to keep track of the gradient of the output with respect to the input, but we will need to know the gradient of the output with respect to the weights. The gradient of the output with respect to the input will be used to pass to the previous layer in the chain, because the previous layer needs to know how it can change it's output to effect the output of the layer after it. The gradient of the output with respect to the current layers weights will be used to update the weights of the layer, to effect the output of the layer after it.
Keeping track of all these gradients can be a little confusing, but the pattern is the same for all layers and once you get the hang of it is not that bad.
First let's tackle the calculating gradient of the output with respect to the input for the LinearLayer. Since it is just a linear transformation, this gradient will just be equal to the weights, which were the coefficients each input was multiplied by. We need to remember this is a chain of functions though, so we also need to apply the chain rule. All this means is we will multiply the weights by the gradientInput from the layer after it. Luckily we have this as input, making the backward calculation pretty simple.
Since this layer does a few operations on matrices I have decided to make a new class called MatrixMath that we can put all functions that have to do with matrices. It will only need a couple methods, namely Multiply(), Transpose(), AddCol(), and RemoveCol(). The header is as follows in include/neural/math/matrix_math.h
The implementation for Multiply() is just stolen from our initial LinearLayer::Forward() function, with a few variables renamed.
And Transpose() simply swaps the rows an column of the matrix like the following gif.
Swapping the rows an columns in code is pretty straight forward.
AddCol() goes through the matrix and adds a column at the far right with the value specified. This is used for when we have a bias term and need to use the bias-trick.
Lastly we need RemoveCol() which is used during backprop to get rid of the internal bias weights.
I've written tests for these modules in the github repo for this project, an you are free to check them out there.
This allows us to simplify the forward pass in our LinearLayer class, as well as reuse the code for Backward(). Here is our new simplified forward pass:
We then need to add some code to Backward to deal with if we have a bias term. We do not need to pass back derivatives for the bias term inputs, because they were never actually passed into the module, an are always one. We simply remove the last column of the gradWrtOutput if there is a bias term.
You may wonder why there is the Transpose() in the matrix multiplication. It will become apparent whether you forgot a Transpose() when multiplying the matrices of different dimensions. In our case, in the last layer, the gradient input would be 1x1 from the error module, and our weights are 3x1. You cannot multiply a 1x1 matrix by a 3x1 matrix, because the inner dimensions would not match. You must transpose the weights to be 1x3 and the matrix multiplication will line up.
With Backward() implemented, we now have the ability to pass the gradient for this layer to the layer before it, but what about the gradient used to update this layers weights? We need the gradient of the output with respect to the weights. Since again this is a linear transform, it will to the input. Using the chain rule, we multiply the input by the gradient from the next layer.
This is all good, but our interface currently does not allow us to return this gradient to update the weights. If you remember from the last training loop we implemented, we accumulated this gradient for the weights until all the training examples were seen, then we updated our weights at the end of the epoch. Let's add some methods to our LinearLayer class to allow for this behavior.
First we add a member variable m_weightGrads to hold onto our weight gradients as we call Backward(). This leaves us with the full implementation of Backward as:
Then at the end of each epoch we can call UpdateWeights() to use the cached gradients to update our weights.
And finally to calculate the average gradient, we implement CalcAvgWeightGrad().
This closes out our LinearLayer class! Let's move onto the backward pass for the ReLULayer.
The gradient for ReLU is pretty straight forward to implement. If the input was greater 0 then it is 1, and if the input was less than zero it is 0.
Using the chain rule, we will multiply this 1 or 0 by the input gradient from the next layer, which means we can just take the gradient input if the input is > 0 and zero it out otherwise.
This should be all we need to head back to our main function and write our full training loop!
First let's define the super small dataset we are trying to model. If you recall, it looks like this:
Turning it into code, we define it at the top of our main function.
Next we define our model like before...
Then for a predetermined number of epochs, we will loop over the training data.
Run our forward pass through our 3 layers.
Calculate and accumulate error so that we can average it.
Then run our backward pass accumulating the gradients and passing them along.
After each epoch, we can calculate the average error, and update our weights in the linear layers with the gradients they accumulated.
Compiling and running our code, we see after 100 epochs through our data, our average error drops to 0.00043 and all of our predictions are rather close.
Pretty cool! The way we have organize our code into "layers" allows us to easily extend our networks to be arbitrarily deep. Hopefully you noticed the interfaces for the layers Forward() and Backward() feed right into eachother, and are the same for both types of layers. This makes it easy to chain all oour function calls into a loop, and have our layers be in a standard library container of pointers all inheriting from the same base class.
Note, there are still some inefficiencies of our library, but I was optimizing this post for clarity, not speed. This code will be very slow if you tried it on larger datasets due to memcpys and our matrix multiplication code. You will also probably run out of memory caching all of our gradients for the entire epoch for gradient descent. But for this small dataset, and to demonstrate how to start designing a neural network library, this is a great starting point!
In the following posts, we will design a more performant version of this library, reusing a lot of the code from here. If you want the full code from this post, it is posted in this github repo. If you feel comfortable with this code and example, come follow me to the next post where we go over what a more fully flushed out API looks like.