In which we pit ourselves against eight GPUs

At the end of the last post we were training CIFAR10 to 94% test accuracy in 256s. This compares to an initial baseline of 341s and to a somewhat unrealistic target of 40s based on 100% compute efficiency on a single V100 GPU. Today we shall be aiming at an intermediate goal – overtaking the winning DAWNBench entry from which used 8 GPUs and trained in 174s. We shall continue with a single GPU since we are a long way from using all the FLOPs.

We can get a rough timing profile of our current setup by selectively removing parts of the computation and running the remainder. For example, we can preload random training data onto the GPU to remove data loading and transfer times. We can also remove the optimizer step and the ReLU and batch norm layers to leave just the convolutions. If we do this, we get the following rough breakdown of timings across a range of batch sizes:

A few things stand out. First, a large chunk of time is being spent on batch norm computations. Secondly, the main convolutional backbone (including pooling layers and pointwise additions) is taking significantly longer than the roughly one second predicted at 100% compute efficiency. Thirdly, the optimizer and dataloader steps don’t seem to be a major bottleneck and are not an immediate focus for optimization.

With the help of resident GPU expert Graham Hazel, we looked at some profiles and quickly found the problem with batch norms – the default method of converting a model to half precision in PyTorch (as of version 0.4) triggers a slow code path which doesn’t use the optimized CuDNN routine. If we convert batch norm weights back to single precision then the fast code is triggered and things look much healthier:

With this improvement in place the time for a 35 epoch training run to 94% accuracy drops to 186s, closing in on our target!

There are many things that we could try to get over the line and bring training below 174s. Further optimizations of the GPU code are available, for instance activation data is currently stored in NCHW format, whilst the fast CuDNN convolution routines for TensorCores expect data in NHWC order. As described here, the forward and backward computations perform transposes before and after each convolution, accounting for a significant proportion of the overall run time. Since native NHWC computation is not supported in PyTorch 0.4 and doesn’t seem to have mature support in other frameworks, we will leave this for now and perhaps revisit in a later post.

Cutting training to 30 epochs, would lead to a 161s finish, easily beating our current target, but simply accelerating the baseline learning rate schedule, leads to 0/5 training runs reaching 94% accuracy.

A simple regularisation scheme that has been shown to be effective on CIFAR10 is so-called Cutout regularisation which consists of zeroing out a random subset of each training image. We try this for random 8×8 square subsets of the training images, in addition to our standard data augmentation of padding, clipping and randomly flipping left-right.

Results on the baseline 35 epoch training schedule are promising with 5/5 runs reaching 94% accuracy and the median run reaching 94.3%, a small improvement over the baseline. A bit of manual optimization of the learning rate schedule (pushing the peak learning rate earlier and replacing the decay phase with a simple linear decay since the final epochs of overfitting don’t seem to help with the extra regularisation in place) brings the median run to 94.5%.

If we accelerate the learning rate schedule to 30 epochs, 4/5 runs reach 94% with a median of 94.13%. We can push the batch size higher to 768 and 4/5 reach 94% with a median of 94.06%. The timings for 30 epoch runs are 161s at batch size 512 and 154s at batch size 768, comfortably beating our target and setting what may be a new speed record for the task of training CIFAR10 to 94% test accuracy, all on a single GPU! For reference, the new 30 epoch learning rate schedule is plotted below. Other hyperparameters (momentum=0.9, weight decay=5e-4) are kept at their values from the original training setup.

Having achieved the target we set at the beginning of the post, it’s time to wrap up for today. The code to reproduce these timings can be found here. Our new record, such as it is, should be easy enough to improve upon. First, we remain below 25% compute efficiency on a single GPU and there are known optimizations that could improve this. Secondly, it should be possible to reduce the number of training epochs using techniques such as Mixup regularisation and AdamW training. We have not explored parameter averaging to speedup final convergence and if we are prepared to do more work at inference time there is the possibility of using test time augmentation to reduce training time further. There are rumours of sub-20 epoch training runs, albeit for a larger network, using a combination of these techniques from folks at and it would be interesting to explore these avenues further.

However, we are going to leave these avenues unexplored for now and instead take a look at the architecture of the network that we’ve been using up till now. We will find that this is an unexpectedly rich seam for optimization.

In Part 4 we simplify the network architecture. Training gets (a lot) faster.