Optimized Solutions for Challenging Machine Learning Applications
To meet the huge increase in demand for AI, technologies must scale efficiently in order to meet strict latency and performance requirements while keeping the total cost of ownership and total power consumption low.
Inefficient solutions are creating challenges today:
- Development teams are reducing model sizes in order to meet strict latency and performance requirements. This reduces the accuracy and therefore the quality of the service, which can directly impact revenue.
- The available hardware in a business’s infrastructure is being underutilised due to inefficiencies and therefore they are required to significantly over provision, leading to increased cost.
Using our patented technologies and the powerful techniques outlined below, we can accelerate RNNs and other DNNs with sparse layers, achieving maximum throughput and ultra-low latency, enabling hyper-scale inference in both data center and edge or embedded applications.
The best way to overcome these challenges and for businesses to avoid these implications is to co-design the algorithms, hardware, and software.
We jointly optimize the algorithms, hardware and software, enabling the final solution to be significantly more efficient than improving any one area alone.
Quantization & Sparsity
Quantization and sparsity are two techniques we employ that are able to easily compress models for deployment.
Quantization is a widely adopted co-design technique, supported by all the major frameworks, that reduces the number of bits used to represent each neural network parameter and activation during inference. With support from our software stack and hardware, this co-design technique:
- Decreases storage and bandwidth requirements by a factor of 4
- Increases performance
- Reduces energy consumption
- Has little to no impact on the accuracy of the final model
Sparsity is another co-design technique that is starting to become more widely adopted. During Sparsity Aware Training, weights of zero or near-zero value are gradually pruned, leaving a “sparse” model. Depending on the network, this can remove up to 95% of the total number of parameters with little to no loss in accuracy. This technique, also supported by our hardware and software stacks:
- Reduces the number of compute and memory operations by up to 95%
- Reduces the storage and bandwidth requirements by more than 10x
- Reduces the memory access energy consumption by more than 100x as the parameters can now be stored on-chip
Due to the fast-changing nature of neural networks, we believe that re-programmable silicon in the form of FPGAs or FPGA-based accelerator cards for data centers will be the heart of optimized inferencing for many applications in the future. We’re in good company with this view; every server Microsoft deploys into Azure data centers contains the re-programmable silicon we program for exactly this reason. Whether in the cloud or on-premise data centers, these cards allow machine learning models to be continually redesigned and deployed. Embedded FPGAs in edge applications can also be upgraded in the field when new, improved models are developed, thus future-proofing the system.
We abstract the hardware design to enable software engineers to harness reconfigurable technology in machine learning, mapping their algorithms onto a mixture of compute resources and achieving previously impossible levels of performance, low energy consumption and execution scenarios.
In some very high-volume applications, it may be expedient to migrate the design to an ASIC after an initial prototyping phase using an FPGA. The Myrtle.ai team can support this approach and provide IP for the ASIC.
Data Center & Edge Applications
FPGA-based accelerator cards are now ubiquitous in the cloud and are being installed in on-premise data centers across the globe. This enables us to deliver the benefits of our machine learning inference solutions rapidly and at scale.
Our ability to massively reduce hardware and energy costs has also enabled us to drive machine learning to the edge, where these costs become critical. As a global edge workload benchmark owner with MLPerf we have the insight to produce optimised solutions for multiple embedded applications on FPGAs or even ASICs.
Whether you need a solution for a data center or edge application, you can evaluate the competitive advantage Myrtle.ai can bring to your business by contacting us today.