Efficient Hyperscale Inference
To meet the huge increase in demand for AI, technologies must scale efficiently in order to meet strict latency and performance requirements while keeping the total cost of ownership and total power consumption low.
Inefficient solutions are creating challenges today:
- Development teams are reducing model sizes in order to meet strict latency and performance requirements. This reduces the accuracy and therefore the quality of the service, which can directly impact revenue.
- The available hardware in a business’s infrastructure is being underutilized due to inefficiencies and therefore they are required to significantly over-provision, leading to increased cost.
Achieving Efficient Inference
We optimize ML inference for efficient hyperscale deployment using our patented MAU Accelerator™ technologies and proven design techniques such as:
- Heterogeneous compute employing algorithm, hardware & software co-design
- Quantization to suit the targeted hardware platform
- Exploitation of sparsity in the model
Combined, these can:
- Reduce the number of compute and memory operations by up to 95%
- Reduce memory storage and bandwidth requirements by more than 10x
- Reduce the memory access energy consumption by more than 100x
while having little to no impact on the accuracy of the final model.
The MAU Accelerator™ can accelerate RNNs and other DNNs with sparse layers, simultaneously achieving maximum throughput and ultra-low latency for hyperscale inference in data center applications. This enables higher quality models to be deployed, providing better services and customer experiences, while significant savings can be made in infrastructure costs and energy consumption.
Maximum throughput with ultra-low latency
- Deterministic low tail latency
- Improved latency-bounded throughput
- Reduced infrastructure costs
- Enables use of higher quality models under a given latency bound
- Reduced energy consumption
Low latency inference acceleration for real-time, memory-bounded workloads including:
- Speech transcription
- Natural language processing
- Speech synthesis
- Time series analysis
- Payment & trading fraud detection
- Recommendation systems
Rapid & Easy Deployment
The MAU Accelerator runs on data center servers enhanced by accelerator cards from Intel, Xilinx and BittWare/Molex. These accelerator cards are available today, both in the cloud and for on-premise data centers, facilitating rapid implementation at scale. Neural network models created using popular ONNX supported frameworks such as TensorFlow, PyTorch or MXNet can easily be deployed on the MAU Accelerator.
Application Example 1: Speech Synthesis
The MAU Accelerator can be used to deliver high fidelity speech synthesis at very high throughput, running WaveNet on a BittWare 520NX Accelerator Card.
- Best in class vocoder model for near-human-quality speech synthesis
- Low, deterministic tail latency
- 16x throughput advantage over a GPU solution
- Significant CapEx and energy savings
Application Example 2: Automatic Speech Recognition
The MAU Accelerator can be used to achieve very high throughput at ultra-low latency, running speech transcription on an Intel PAC D5005 or Xilinx Alveo U250 Accelerator Card.
- 165x higher performance than a CPU-only solution
- 2.1x higher performance per watt than a GPU solution
- 29x lower latency than a GPU solution
Application Example 3: Natural Language Processing
The MAU Accelerator can be used to significantly reduce the server infrastructure required for an NLP workload when run on an Intel PAC D5005 or Xilinx Alveo U250 Accelerator Card.
- 2.2x lower cost than a CPU-only solution
- 7.7x smaller carbon footprint than a CPU-only solution