Optimizing Our Neural Network With Blas And Openmp
How to speed up our neural network with paralellization and open source libraries.
In the last post we downloaded the MNIST dataset and wrote a data loader to turn the images and labels into tensors. We then fed these images through a feed forward neural network with a linear regression objective function to try to guess what hand written digit was present. If you want a fresh start with the code, feel free to grab it from github.
When we run this starter code, it is apparent that it takes way too long for the network to process each example. In fact, it processes just over 1 image per second. At this rate, it would take over a day to see all 60,000 examples. This is far too slow for an experiment on a dataset of this size! By the end of our optimizations, we should be able to get an iteration for a single image down to ~0.001 seconds. That's over 1000x speed up! Not only that, but we will add in a method called mini batching that will allow us to leverage BLAS even farther.
In the last post we added some logging and figured out the bottle neck was the matrix multiplication step in our linear layers. Matrix multiplication is an operation that is of computational complexity O(n^3), and a well known reason for why neural networks can be slow.
Luckily there are some robust and fast libraries out there that make speeding up this operation relatively simple. A commonly used one is called BLAS which stands for "Basic Linear Algebra Subprograms". BLAS has done a lot of the work optimizing and parallelizing many linear algebra problems to run fast on many CPU architectures.
Before we add the BLAS implementation, let's get some baseline stats on how fast our code is running by putting some logging in our LinearLayer module. Edit the LinearLayer::Forward method to log before and after the matrix multiplication. Glog will put timestamps with each log statement so we can see how long they take.
When we run the code, we see it takes about 0.14s to run the first layers matrix multiplication.
Now let's add BLAS to see the performance boost. On OSX you can simply use homebrew to install the libary:
brew install openblas
(If you are on other platforms a good old google search should do the trick)
You will then need to add the proper libraries and headers in your CMakeLists.txt file. This leaves us with a new CMake file that should look like:
Then you will the src/tensor_math.cpp file to use the library call instead of our own implementation.
Here you see the c-style function "cblas_sgemm". It takes in raw pointers to our data arrays, and some variables about their size. "C" is our output array, that we grab from a tensor we created with the correct shape. If you want info on more of the parameters Apple has some limited documentation here.
You will also need to include the cblas header.
On OSX brew will install the library in /usr/local/opt/openblas, so we will update our cmake build by running cmake -D BLAS_INCLUDE_DIR=/usr/local/opt/openblas/include -D BLAS_LIB_DIR=/usr/local/opt/openblas/lib .. from the build directory.
If you now run the new binary you will see that the new implementation is almost instant.
The 1x785 * 785x300 matrix multiplication now runs in less than 0.001 seconds on my machine. This is a great start, but you may notice that there is still a slow section of our code between iterations.
It takes 0.3 seconds between the time we calculate error from the forward pass, to the time we start the next iteration. This means there is some other bottle neck in our backward pass. Let's add some more logging to the linear layer to see where exactly this may be.
When we look our backward pass, we see that we are doing more matrix multiplications, but we also are doing a transpose operation. We will have to break this code up and add more logging to see if this is something we need to optimize.
We see that the transpose in our smaller layer is reasonably fast, but in the larger layer with a 785*300 weight matrix it takes 0.07 seconds. I bet we can speed this up.
Let's add another library that we will use extensively for paralellization called OpenMP. This library makes it very easy to break up operations in a loop into a multi-threaded implementation. If every operation is independent in your loop, ie. they do not rely on one being completed before another, OpenMP can split up the work into as many CPU cores as you have on your machine.
Again, it is easy to install on OSX assuming you have homebrew:
brew install libomp
We then have to update our CMakeLists.txt file again, adding some code to find the OpenMP package.
As well as adding this library to the LIBS section.
Now let's head over to src/tensor_math.cpp again and update our Transpose() function.
When using OpenMP with our tensors, we will find that it is most optimal to loop over the raw data matrix if possible, instead of using two for loops for each row and column. For our Transpose() function it will look like this:
In just about the same amount of code, we will see an order of magnitude level speed up. Compile and run to see the difference.
Now our transpose operation takes ~0.001s! But there is more work to do. Check out the time between "gradient computation" and the next iteration "ITER (0,1)". There is still 0.26 seconds of precious time being taken.
I'll save you the step of adding logging to figure out where the next point of optimization is, and just tell you that it is when we are updating our weights. We can again simply turn our two sets of for loops into single for loops that are parallelized and see the difference.
If we get rid of all of our debug logging, we can see that our new iteration time is now about 0.002 seconds!
By the 800th image seen (which now only takes 1.6 seconds to get to), we are even classifying some of the training images correctly.
This is fantastic progress. We dropped our iteration time by multiple orders of magnitude, and added two libraries in BLAS and OpenMP that we will use where ever possible to keep our code optimized.
At this point we could stare at the logs and hope we start seeing some images classified correctly, but how would we know exactly how well the net is performing? How would we know when to stop training? This is exactly why the MNIST dataset gives us a test set as well as a training set. In the next post we are going to use our same data loader to load the test set and write a module that will help us compute our accuracy on the test set to see how well our net has learned the task. If you want the full code from this post, feel free to grab it here.