In this series of posts, we investigate how to train Residual networks on the CIFAR10 image classification dataset and how to do so efficiently on a single GPU.

To track progress we report the time taken to train a network from scratch to 94% test accuracy. This benchmark comes from the recent DAWNBench competition. At the end of the competition, state-of-the-art was 341s on a single GPU and 174s on eight GPUs. By the fourth post, we will be training in under 100s on a single GPU, comfortably beating the winning multi-GPU time, with plenty of room for improvement. Code to reproduce this result is available here.

Later in the series, we try to gain insight into the training dynamics and extract lessons for other settings.


  1. Baseline: We analyse a baseline and remove a bottleneck in the data loading. (training time: 297s)
  2. Mini-batches: We increase the size of mini-batches. Things go faster and don’t break. We investigate how this can be. (training time: 256s)
  3. Regularisation: We remove a speed bump in the code and add some regularisation. Our single GPU is faster than an eight GPU competition winner. (training time: 154s)
  4. Architecture: We search for more efficient network architectures and find a 9 layer network that trains well. (training time: 79s)
  5. Hyperparameters: We develop some heuristics to aid with hyperparameter tuning.
  6. Weight decay: We investigate how weight decay controls the learning rate dynamics.
  7. Batch norm: We learn that batch normalisation protects against covariate shift after all.
  8. Bag of tricks: We uncover many ways to speed things up further when we find ourselves displaced from the top of the leaderboard. (final training time: 26s)