Classifying Handwriting Digits With A Feed Forward Neural Network
How to write a data loader for the MNIST data set and classify using a neural network
So far in our series of posts we have been using very small datasets to illustrate the intuition behind neural networks. We have built our way up to a modular neural network library that can learn on tensor data to learn non-linear functions. Now it is time to see the power of the code we have written so far, and apply it to some real world datasets.
The typical "hello world" dataset for machine learning is called MNIST. It is a collection of handwritten digits that has been split into a training set of 60,000 images and a test set of 10,000 images.
The dataset consists of 28x28 black and white images of numbers 0-9 that people have written, and has a nice variety of shapes and styles. If you scan a check with your bank, it is likely that they are using similar technology to what we are going to build to recognize the amount of money on the check (although the banks technology would be more advanced being able to segment the check into 28x28 regions that contain the individual numbers).
In this post, we are going to download this dataset and write a data loader to parse to images, and turn them into tensors. Then we will feed the tensors as training data to our neural network, and see how well it performs on new images it has never seen before.
The data is split into 4 files on the MNIST website.
You can go directly to the website, or download each of them here:
train-images-idx3-ubyte.gz: training set images (9912422 bytes)
train-labels-idx1-ubyte.gz: training set labels (28881 bytes)
t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)
t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)
We are going to use the neural network library we have been slowly developing as a starting point for this task. If you have not been following along or want a fresh start with your codebase you can grab the starter code here.
In the last post we went over the important data structure called "the tensor". In this post, we are going to fill this data structure with the black and white images from the MNIST dataset, and see if we can use our regression objective function to classify the images into 1 of 10 categories.
Each image in the dataset is represented by a grid of 28x28 (or a total of 784) pixel values between 0-255. 0 means background (white), 255 means foreground (black).
The data files you downloaded above are a binary representation of this data. You will not be able to simply open them in your favorite image previewer. In fact the train-images-idx3-ubyte file contains 60,000 of these images for training, and the t10k-images-idx3-ubyte file contains 10,000 images for testing. The other two files (train-labels-idx1-ubyte and t10k-labels-idx1-ubyte) contain the labels for each one of the images.
The MNIST website has a description of how the data is organized.
With the data format known, we can hop right into writing our data loader. Let's creating a new header, implementation, and test for loading the MNIST data.
#!/bin/bash mkdir include/neural/data touch include/neural/data/mnist_dataloader.h touch src/mnist_dataloader.cpp touch tests/mnist_dataloader_test.cpp
The data loaders job is going to be to read our input files from disk, and convert them to tensors we can feed into our network. Let's define some unit tests to see what we want the API to look like.
First, we know that we expect the training data to have 60,000 examples and the test data to have 10,000 examples.
We will have 2 parameters for our constructor: the data path and whether or not it is loading training data. In these tests we are simply checking that we read the correct data length for the training and tests sets. You may notice the hard coded data path "../data/mnist/". If you haven't downloaded the data yet, go ahead and download it, unzip it, and put it in a directory called data at the top level of your repo. We will be running the ./tests binary from the build/ directory, which is why we include the ".."
The directory structure should look like the following:
Next we want to verify we can correctly load the first couple examples.
The first example in the dataset is an image of a 5, and the third example is an image of a 4. It would be hard to verify all the pixels in the image are correct because there are 784 values to check. This is why we check two separate input values, because if they are both correct, it probably means our offset math in the function is correct. At minimum we should check if the sizes of all the tensors are correct.
To make these tests compile, let's define our header file with the constructor and methods.
We will default to the data being training data in the constructor. The we pass in two mutable tensor pointers to our DataAt() function as return values. The DataAt() method will also return a boolean to indicate whether we were able to get the requested example.
Now let's implement the constructor to read the proper values from our input path.
All the constructor is doing is defining where we are going to read the data from, and initializing some member variables we will need to read the data later. There are a few helper functions here to test if files exist, and read sizes from the files. Let's go ahead and define these helpers in the header.
There are a few methods here, but each one will be succinct and help us get the job done. Let's start with testing if a file exists.
p_FileExists() simply tries to open the file if it is there, if not it will return false.
Next let's recall how the data is formatted.
TRAINING SET IMAGE FILE (train-images-idx3-ubyte): [offset] [type] [value] [description] 0000 32 bit integer 0x00000803(2051) magic number 0004 32 bit integer 60000 number of images 0008 32 bit integer 28 number of rows 0012 32 bit integer 28 number of columns
We are going to have a generic function to read any of these integer values in the header of our file.
We first skip n bytes at the start of the file depending on which index we want to read. Then we read the value, and reverse it since I am using an intel processor on my mac. Most PCs these days use an intel processor, but you can check yours to see if you need to add the p_ReverseInt() function call. If on OSX you can just look at the apple logo in the upper left and it should tell you the specs.
Now we can implement our functions for reading the number of images, the image width, and image height using our p_ReadIntAt() function.
This should be good enough to get our first two tests passing. Now onto the DataAt() function.
In this function we are opening the files and skipping to the data we are interested in. Then we read the bytes, and load them into our tensor. Each pixel is stored in a uint8_t, or a single byte. Meaning it will be a value between 0-255. We cast it to a float before adding it to our input data.
Neural networks learn much better if their input values are between -1.0 and 1.0. We can discuss why this is in a later post, but for now, let's add a function to scale our range of [0,255] to [-1,1].
Then update our DataAt() function to use this scaled value.
Great! Now our first couple unit tests should pass.
Let's head over to our main function to update it to use our fancy new data loader. Open up tools/feedforward_neural_net/main.cpp and add the include for our data loader.
We can then get rid of our dummy dataset with instantiating our data loader.
Next let's define our model with the correct input and output sizes.
We will use a 2 layer neural network with 300 hidden units to mimic one of the examples on the mnist website that achieved a 4.7% error rate. This means this neural network should be able to correctly classify the hand written digits 95.3% of the time. Notice we use our Tensor::Random method to generate the initial weights for the layers. We specify that the weights should be small random numbers between -0.01 and 0.01.
Now let's change the training loop to use our new data loader instead of the hard coded dummy data.
This is pretty similar to our training loop from before, but it uses the data loader to grab tensors at each iteration. There are also 60,000 training examples here, so waiting until the very end to update the weights would take up too much memory seeing as we store the gradients each call to "Backward". It would probably be too noisy of a signal to learn off of anyways since we are averaging the gradients during the update. In this case, we will update our weights after each example is seen.
If you run this code, you will see it starts off guessing a very large number, no where near our target outputs of 0-9, but around the 5th iteration it starts guessing more reasonable numbers. Note: this may not be the exact same output for you, since we started with different random weights.
Congratulations! We are now learning off some real data! There are currently a few problems with our implementation, but don't worry we will knock them out one by one in the next couple posts to achieve greater than 90% accuracy.
The biggest problem right now (in my opinion) is that our current implementation is very slow. It takes anywhere from 0.5-1.0 seconds to learn off of a single image depending on your CPU. Assuming 0.7 seconds, this means it will take 0.7*60000=42000 seconds, 42000/60=700 minutes, or 700/24=29.16 hours to go through all the examples once. I don't know about you, but I am not willing to wait a day to see our results after 1 epoch. It often takes 10 to 100 epochs to see results over 90% accuracy. It would probably take half a year to see anything close to state of the art! Then imagine if you have a bug in your code? This is way too long to wait for an experiment to run.
If we put a few more comments into the code to find out where exactly the bottle neck is, we will find out it is in our linear layers, specifically the matrix multiplication and transpose operations. This is a well known bottleneck of neural networks, and is a feature all neural network frameworks put time into optimizing.
Luckily matrix multiplication is an operation that is easily parallelizable. Each output in a matrix multiplication is independent of the other outputs, so we can process each step in parallel. We could parallelize this operation ourselves, but it will probably still not be fast enough to train most neural networks. Luckily there are libraries out there that take advantage of specialized CPU and GPU architectures to do this computation for you fast. We will start with a CPU implementation that uses the BLAS library which has implementations of many specialized linear algebra equations.
Follow me to the next post to see how we can add the BLAS library to our project to make our matrix multiplications blazing fast on most cpus. If you want the full code for this example, you can find it here.