We care about understanding the machine learning models we produce and making them as small and accurate as possible. We like to give our understanding back to the community by explaining techniques we’ve developed on tiny datasets.
The open source notebooks that form part of this series officially held the #1 spots of two Stanford machine learning league tables for over six months until April 2019. Today, all models that currently rank above those notebooks are derivations of this original work.
In this white paper we survey a wide variety of model compression techniques that are amenable to deployment on a range of hardware platforms. In particular, we compare different model sparsity methods and levels, and seven widely used precisions as targets for quantization.
When we were looking for a great application to run on the Intel Stratix 10 NX FPGA, we turned our attention to WaveNet, a neural network model that we know to be extremely difficult to implement on existing compute platforms, see our previous blog post https://myrtle.ai/learn/wavenet/. Two years on and armed with new AI-optimised FPGA
We jointly published this white paper with Intel, describing the way in which a WaveNet vocoder model can be compressed to optimize the use of the AI Tensor blocks and HBM memory on the Intel® Stratix® 10 NX FPGA.
This white paper explains how we exploited the sparsity inherent in typical RNNs and used quantisation to compress an ASR model by as much as 95% with minimal loss of accuracy.
In the final post of the series we come full circle, speeding up our single-GPU training implementation to take on a field of multi-GPU competitors. We roll-out a bag of standard and not-so-standard tricks to reduce training time to 34s, or 26s with test-time augmentation.
We investigate how batch normalisation helps optimisation (spoiler: it involves internal covariate shift…). Along the way we meet some bad initialisations, degenerate networks and spiky Hessians.
We learn more about the influence of weight decay on training and uncover an unexpected relation to LARS.
We develop some heuristics for hyperparameter tuning.
In which we try out some different networks and discover that we’ve been working too hard So far, we’ve been training a fixed network architecture, taken from the fastest single-GPU DAWNBench entry on CIFAR10. With some simple changes, we’ve reduced the time taken to reach 94% test accuracy from 341s to 154s. Today we’re going
We identify a performance bottleneck and add regularisation to reduce the training time further to 154s.
We investigate the effects of mini-batch size on training and use larger batches to reduce training time to 256s.
We establish a baseline for training a Residual network to 94% test accuracy on CIFAR10, which takes 297s on a single V100 GPU.
The introduction to a series of posts investigating how to train Residual networks efficiently on the CIFAR10 image classification dataset. By the fourth post, we can train to the 94% accuracy threshold of the DAWNBench competition in 79 seconds on a single V100 GPU.
Are GPUs a good target for speech synthesis? Is Baidu’s GPU implementation of WaveNet the best you can do on a GPU? We run some tests, discuss latency and find out