Evaluating Mnist Accuracy On The Test Set
How to quantitatively know how well your neural network is generalizing.
If you have been following along, we have optimized our code and sped up our neural network to be fast enough to predict the number in images of hand written digits from the MNIST dataset. To grab our code up to this point, feel free to grab it from github.
Given a 28x28 black and white image of a number, like the one above, we use are using a regression technique predict the number contained inside. For example we would like to predict a number close to "2" for the image above.
The problem now is, we do not know how well our neural network is actually performing. All we see in the logs is a number reported from our squared error function. We would like a more concrete metric to see how well we are doing. In this case, we are going to use accuracy (number images correctly classified / total number of images seen).
You may remember that the MNIST dataset comes with a training dataset, and a test dataset. These datasets contain different sets of images. We will compute the accuracy on the test data set so that we know that the neural network is generalizing to new data, and not just memorizing the examples we have shown it so far.
We have already set up our data loader to toggle between loading the train set and the test set given a parameter in the constructor, so let's add a new data loader to our main function.
We then will want a function to compute the accuracy on the test set given our model's layers and the test data loader.
This function is very similar to our training loop, but does not perform back propagation. It simply passes an example from the test set through the layers until we get an output. This output will be a floating point value (without a limit really, we just hope it is between 0-9). We round the output to the nearest integer, and compare it to the target label. If the rounded value equals the target, we count that as correct. At the end we simply divide to number correct by the total number.
We can then use this function at the end of our training loop to see how well our model performing on the test set.
Since it takes a couple minutes to get through all the data, I also decided to calculate the accuracy every 10000 examples. Make sure you lower your learning rate for this example to 0.0001 (which I just found to work better than higher ones from tinkering), then build and run our binary.
You will see the very first time it runs the CalcAccuracy() computation, we get absolutely none correct.
This is to be expected, we would have only seen one training example at this point, and the output was -19.50! This isn't remotely close to our target values between 0-9. Note: your exact prediction values may be slightly different seeing as we initialized our nets with different random weights. In fact it will be different every time you run the code.
Never fear though, after only 10,000 examples our network is doing better than random predicting the correct label 19% of the time!
I am printing our a prediction every 1000 examples just for debug, and we can see that some of the predictions are getting pretty close to the targets, while others are between 0-9 but still pretty far away.
You may think to yourself, 19% is still very bad, and you are correct. When watching neural networks learn at the beginning they will often start at very low accuracies, but it is always good to note how well it would be doing if we "randomly guessed". In our case, our output value is unbounded, meaning it could literally predict any floating point number a computer could compute. The fact that it it went from 0% to 19% after 10,000 images is encouraging.
Another way to look at it is that even if the net learned "ok, I am going to guess a random number between 0-9, since those are the only values I see for output", we would still only be at 10% with the random guesses. The fact that we are at 19% means that we are doing much better than random guesses on the images (but again, still pretty terrible).
Let's let it train for a little longer to see how well we do. I advise piping the output to a log file that you can refer back to later to see how the accuracy progresses, since it will take a few minutes to get through our data.
./feedforward_neural_net &> log.txt
After the first epoch, we can see we are up to 30% accuracy on the test set!
If you run this code for ~10 epochs I've seen the accuracy get up around 50%. This is much better than random, but clearly not state of the art.
If you think about the value our network is outputting, it may become apparent why we are getting stuck. Many numbers have patterns that look like other numbers, but are not close in our output value. For example look at these 1's and 7's.
If you guess a 7 for that first 1, our error metric would tell us you are wayyy off, and the gradient would adjust your weights so that next time you are much closer to a 1 next time. When in fact 7 is probably the second most likely number given any other number.
3's and 8's look pretty much identical on the left half of the image. Some of the 5's are just squiggles. 9 is an upside down 6. It is easy to see how only letting our network predict one value is a little harsh.
A lot of debugging why a neural network is not learning, usually comes down to the data you are presenting it, and the objective function (assuming your implementation of all the layers is correct). It takes some experimentation to figure out what configuration will work the best. This is why we implemented our CalcAccuracy() function to tell us if we are making changes that are really substantial or not.
Let's test a hypothesis of how we can make our neural network more robust. Right now our second linear layer is 300*1, meaning it will only predict one output. What if we simply predicted 10 outputs instead? In terms of defining our layers, all we would have to do is change the output value from 1 to a 10:
This changes our network from one with a single output value.
To a network with ten output values.
This way if the net is unsure if it is a 1 or a 7, it can predict relatively high numbers for both output slots, and then we can use the gradient of the error function down-weight the output that is incorrect and up-weight the output that is correct.
This also means we will have to change our targets to have 10 values. We simply use a 1-hot encoding, where the correct answer is a 1, and the rest are zeros.
Being able to adjust each output individually will help us fine tune our network even if it is unsure about multiple answers.
Let's change the data loader to load the output values into 1x10 tensors instead of 1x1.
We will also have to change our loss function to handle a variable number of outputs (in this case 10) instead of just 1.
I decided to make a new class for this, since it will have a slightly different interface. I also call it "mean squared error", since we will get an average of all the errors to produce a single floating point output value for the loss. Mean squared error is a more common term and is abbreviated as MSE in many machine learning frameworks.
Create the header in our include/neural/loss directory called mean_squared_error_loss.h
Our Forward() method will still return a float, it will simply average the errors for each output. The Backward() method now returns a tensor of gradients, one for each output, so that we can adjust their values individually.
The implementation for Forward() will now loop over all the inputs and targets, sum them up and divide.
Then Backward() will calculate the derivative for each one of the input-target pairs, giving us a 1x10 gradient.
We now need to update our training and test loop to use the new loss function. I've highlighted the changes with comments that use ** to draw your attention to them.
Besides including a new our new header, you will notice the MaxIdx() function on our tensors. Our output and target tensors are 1x10 vectors, so in order to find which label was correct, we need to know the index of the element with the maximum value.
For example, we could have the two tensors output and target:
In order to compare them, we want to know the index with the maximum value, to see if our network is predicting a high value for the correct 1-hot encoded index.
Let's implement this function in our Tensor class now. Open up the Tensor header file and add the public method:
Then implement it by keeping track of the maximum value and current maximum index as we iterate through all the data.
This should be all we need to test out our hypothesis that 10 outputs compared to 10 targets is better than a single output. Run your new implementation to see how we do!
You will see that the very first time we test the accuracy on the test set we already are at ~8%. This is actually about the worst we can do now, since we are taking the max value of 10 outputs. We are essentially randomly guessing numbers between 0-9, meaning we will be right about 10% of the time. Maybe the data is skewed towards certain classes, maybe we got unlucky with our random initialization, who knows. It is important to remember this "random" baseline so we are not surprised by our initial accuracy calculation.
After running for 10,000 examples we see our accuracy on the test set has already rocketed up to 72%!
This is wonderful news! We haven't even made it a quarter of the way through the training set and we have already greatly surpassed the accuracy of our last architecture. It should make sense intuitively why this is. Not only can we wiggle individual outputs in the correct direction, but our network has 10x more parameters in the last layer it can optimize. It has the computational power that our single output had last time for each one of its ten new outputs.
By the end of the 10 epochs in my experiment, the network had achieved 89% accuracy on the test set, and 93% accuracy on the last 10,000 examples of the training set.
Not too shabby for our second experiment! There is still plenty of room for improvement, but our neural network has gone from failing out of "Hand Written Digits 101" to an A- with just a few tweaks.
You may be thinking it is still rather odd that the network can predict arbitrary linear outputs for each number. In fact, if you looked at our very first prediction, with randomly initialized weights, you would see that all the predicted values are negative.
We should never allow our network to predict negative values seeing as this isn't possible in our training data. As the net learns a little more, the predictions get a little more reasonable.
It predicted the highest value for the correct label this time, but the numbers still seem a little arbitrary. There are still negative values, and there is not really a rhyme or reason for what 0.51 means.
In the next post we will use a "softmax" function, as well as a new loss function called "cross entropy loss" to further improve on our existing neural network architecture. This will put our outputs in a more intuitive range, as well as improve our accuracy. If you want the full code for this post, you can find it here.