We care about understanding the machine learning models we produce and making them as small and accurate as possible. We like to give our understanding back to the community by explaining techniques we’ve developed on tiny datasets.
The open source notebooks that form part of this series officially held the #1 spots of two Stanford machine learning league tables for over six months until April 2019. Today, all models that currently rank above those notebooks are derivations of this original work.
In the final post of the series we come full circle, speeding up our single-GPU training implementation to take on a field of multi-GPU competitors. We roll-out a bag of standard and not-so-standard tricks to reduce training time to 34s, or 26s with test-time augmentation.
We investigate how batch normalisation helps optimisation (spoiler: it involves internal covariate shift…). Along the way we meet some bad initialisations, degenerate networks and spiky Hessians.
We learn more about the influence of weight decay on training and uncover an unexpected relation to LARS.
We develop some heuristics for hyperparameter tuning.
In which we try out some different networks and discover that we’ve been working too hard So far, we’ve been training a fixed network architecture, taken from the fastest single-GPU DAWNBench entry on CIFAR10. With some simple changes, we’ve reduced the time taken to reach 94% test accuracy from 341s to 154s. Today we’re going
We identify a performance bottleneck and add regularisation to reduce the training time further to 154s.
We investigate the effects of mini-batch size on training and use larger batches to reduce training time to 256s.
We establish a baseline for training a Residual network to 94% test accuracy on CIFAR10, which takes 297s on a single V100 GPU.
The introduction to a series of posts investigating how to train Residual networks efficiently on the CIFAR10 image classification dataset. By the fourth post, we can train to the 94% accuracy threshold of the DAWNBench competition in 79 seconds on a single V100 GPU.
Are GPUs a good target for speech synthesis? Is Baidu’s GPU implementation of WaveNet the best you can do on a GPU? We run some tests, discuss latency and find out