Why the world needs MLCommons
A common thread in the DNA of technical research and development teams is a strong desire to make a meaningful contribution to the future in their chosen discipline. ML experts and software and hardware developers here at Myrtle.ai are certainly no exception to that. They recognize that the best contributions may never see the light of day if the industry does not work together to bring the enormous potential of ML to benefit society.
In the early days of ML, different teams in organizations around the world would claim performance advantages over competing organizations, based on a metric that suited their specific choice of software and hardware to implement a model. MLPerf has gone a long way towards making those comparisons meaningful, with their rigorous selection of fair and useful benchmarks. We recognized the value of that initiative early on and have been instrumental in its development as co-chair of the speech working group and owner of the speech transcription inference benchmark.
One thing that became clear to us was the lack of data to enable organizations to develop effective speech models. There wasn’t enough openly available data in English language speech, let alone the multiple other languages in use around the world. Witness what happened in the field of image classification when ImageNet was released. It drove a revolution in image classification and we need to see the same in other fields now.
The third cornerstone required to bring the industry together and hence accelerate innovation in ML is a shared infrastructure for portability and reproducibility. In recognition of these three needs, a new open engineering consortium, named MLCommons, has now been created to develop globally-accepted metrics and benchmarks, datasets and best practices. We’re proud to be a founding member of MLCommons, working alongside other agile innovators ranging in size right up to global giants such as Intel, Microsoft and NVIDIA. One of the first projects to come out of MLCommons will be The People’s Speech open dataset. We’ve been part of the team that will make widely available over 80,000 hours of transcribed speech in diverse languages by diverse speakers. We’re proud to be contributing to a future in which R&D teams can fulfil their potential and improve the future for us all.