To meet the huge increase in demand for AI, technologies must scale efficiently in order to meet strict latency and performance requirements while keeping the total cost of ownership and total power consumption low.
Inefficient solutions are creating challenges today:
Using our patented technologies and the powerful techniques outlined below, we can accelerate RNNs and other DNNs with sparse layers, achieving maximum throughput and ultra-low latency, enabling hyper-scale inference in both data center and edge or embedded applications.
The best way to overcome these challenges and for businesses to avoid these implications is to co-design the algorithms, hardware, and software.
We jointly optimize the algorithms, hardware and software, enabling the final solution to be significantly more efficient than improving any one area alone.
Quantization and sparsity are two techniques we employ that are able to easily compress models for deployment.
Quantization is a widely adopted co-design technique, supported by all the major frameworks, that reduces the number of bits used to represent each neural network parameter and activation during inference. With support from our software stack and hardware, this co-design technique:
Sparsity is another co-design technique that is starting to become more widely adopted. During Sparsity Aware Training, weights of zero or near-zero value are gradually pruned, leaving a “sparse” model. Depending on the network, this can remove up to 95% of the total number of parameters with little to no loss in accuracy. This technique, also supported by our hardware and software stacks:
These techniques and the resulting compelling benefits over CPUs and GPUs are described in this white paper.
Due to the fast-changing nature of neural networks, we believe that re-programmable silicon in the form of FPGAs or FPGA-based accelerator cards for data centers will be the heart of optimized inferencing for many applications in the future. We’re in good company with this view; every server Microsoft deploys into Azure data centers contains the re-programmable silicon we program for exactly this reason. Whether in the cloud or on-premise data centers, these cards allow machine learning models to be continually redesigned and deployed. Embedded FPGAs in edge applications can also be upgraded in the field when new, improved models are developed, thus future-proofing the system.
We abstract the hardware design to enable software engineers to harness reconfigurable technology in machine learning, mapping their algorithms onto a mixture of compute resources and achieving previously impossible levels of performance, low energy consumption and execution scenarios.
In some very high-volume applications, it may be expedient to migrate the design to an ASIC after an initial prototyping phase using an FPGA. The Myrtle.ai team can support this approach and provide IP for the ASIC.
FPGA-based accelerator cards are now ubiquitous in the cloud and are being installed in on-premise data centers across the globe. This enables us to deliver the benefits of our machine learning inference solutions rapidly and at scale.
Our ability to massively reduce hardware and energy costs has also enabled us to drive machine learning to the edge, where these costs become critical. As a global edge workload benchmark owner with MLPerf we have the insight to produce optimised solutions for multiple embedded applications on FPGAs or even ASICs.
Whether you need a solution for a data center or edge application, you can evaluate the competitive advantage Myrtle.ai can bring to your business by contacting us today.