Jeevan Singh Bhoot
1. Introduction
Large language models (LLMs), such as GPT [1] and Llama [2, 3], have revolutionised natural language processing with their remarkable performance, but their computational demands present significant challenges. Efficient model inference is crucial for practical deployment, and quantization is a key technique to reduce the cost and latency of inference. While weight quantization techniques, such as GPTQ [4], QuiP# [5], and AQLM [6], have gained popularity for reducing memory usage and accelerating inference of LLMs, less attention has been given to quantizing activations.
At Myrtle.ai, we are working to enable efficient LLM inference. In this blog post, we show that block floating point formats, which are natively supported on a wide range of new silicon devices, are effective alternative numerical formats for LLMs. Specifically, we show that it is possible to quantize both weights and activations in Llama3 [3] to block floating point 16 with minimal to zero quality degradation, providing an 8x speedup for the compute-bound prompt-processing phase compared to fp16/brainfloat16 on AI-optimized FPGAs.
1.1 Weight and Activation Quantization Difficulty
Weight quantization of LLMs is well-understood due to the distribution of weights being approximately uniform. Activation quantization, however, is challenging because outliers in activations can be up to 100x greater [7] than most values, causing significant quantization errors.
LLM.int8() [8] achieves 8-bit activation quantization by employing a mixed-precision scheme, with outliers kept in FP16 and other values quantized to INT8, which is difficult to implement efficiently on specialised hardware. SmoothQuant [7] offers INT8 activation quantization by smoothing the activation distribution and shifting the quantization difficulty from activations to weights. This process adds complexity by requiring offline transformations, additional smoothing factors, and a calibration dataset.
NVIDIA’s Hopper and Ada Lovelace architecture GPUs, such as the H100, natively support FP8 [9]. NVIDIA’s TensorRT-LLM library and vLLM enable 8-bit weight, and 8-bit activation (W8A8) quantization. Both libraries add overhead by calculating static scaling factors for the weights and dynamic scaling factors for activations to preserve accuracy, which limits latency improvements [10].
Furthermore, Meta applies FP8 quantization to Llama3 [3] using their own kernels. However, they do not quantize any parameters in the self-attention layers, nor do they quantize the first and last transformer blocks. Similarly to TensorRT-LLM and vLLM, Meta’s FP8 quantization utilises dynamic scaling factors to reduce degradation to model accuracy.
1.2 Block Floating Point Formats
Figure 1: Block floating point configurations compared with other standard number formats.
Block floating point 16 (BFP16) and Block floating point 12 (BFP12), as shown in Figure 1, have been demonstrated in previous works [11, 12] as a suitable number format for deep neural networks such as CNNs, RNNs, and small transformers (e.g. BERT). Achronix Speedster 7t, Intel Stratix 10 NX, and Intel Agilex 5 series FPGAs natively support block float operations, making it highly desirable to use these formats for LLM inference on these devices. Given that earlier works have highlighted the non-trivial nature of quantizing LLMs, it was essential to verify that these formats would function effectively with larger models such as Llama3-70B.
Here’s how these formats work:
- BFP16:
- 1 sign bit and 7 mantissa bits.
- Shared 8-bit exponent across block of 8.
- Effective bit count: 9 bits per number.
- BFP12:
- 1 sign bit and 3 mantissa bits.
- Shared 8-bit exponent across block of 16.
- Effective bit count: 4½ bits per number.
The block floating scheme provides a better dynamic range than fixed-point arithmetic and achieves accuracy close to traditional floating point, all while maintaining the efficiency of integer processing. This approach offers advantages over pure INT8 quantization in terms of precision, though it does not reach the same level of accuracy as FP16.
2. Methodology
We simulated block floating point in PyTorch and evaluated our models using the Eleuther.ai evaluation harness on the six benchmarks constituting the HuggingFace Open LLM Leaderboard v1: AI2 Reasoning Challenge (25-shot), HellaSwag (10-shot), MMLU (5-shot), TruthfulQA (0-shot), Winogrande (5-shot), and GSM8k (5-shot).
2.1 BFP16 Weights
We report the average score across the 6 benchmarks for BFP16 weight quantization in Table 1.
Weight Precision | 8B | 70B |
bfloat16 | 68.11 | 79.16 |
BFP16 | 67.84 | 78.94 |
Table 1: Average score on HuggingFace Open LLM Leaderboard v1 with BFP16 weight quantization for Llama3-Instruct.
The results show that using BFP16 weights results in just a 0.22% absolute degradation compared to the unquantized baseline for Llama3-70B. This entirely post-training quantization (PTQ) method does not require any calibration data, and utilises a simple round-to-nearest (RTN) approach. This quantization scheme reduces the average bits per parameter from 16 to 9, thus reducing the memory consumption of the weights by almost 50%.
2.2 BFP16 Weights & Activations
Figure 2 illustrates our precision mapping. We quantize all operations to BFP16, except for softmax, SiLU, and RMSNorm, which remain in bfloat16 to preserve accuracy. These operations are relatively small and inexpensive compared to the matrix multiplications and pointwise operations that dominate transformer compute.
Figure 2: Precision configuration for the activations in a Llama transformer block.
Most existing schemes (such as GPTQ, QuIP#, and vLLM) avoid quantizing the weights of the final linear projection to the vocabulary because they found severe accuracy degradation when doing so. However, we found that BFP16 effectively quantizes both the activations and weights of the final linear projection, without the degradation experienced by other methods.
2.2.1 BFP16 Preserved Accuracy Compared to bfloat16
Table 2 shows the results for quantizing both the weights and activations in Llama3 to BFP16.
Precision | Average Score | |||
Weights | Activations | WxAx | 8B | 70B |
bfloat16 | bfloat16 | W16A16 | 68.11 | 79.16 |
FP8 (vLLM) | FP8 (vLLM) | W8A8 | 68.22 | 79.16 |
INT8 (LLM.int8()) | INT8 (LLM.int8()) | W8A8 | 68.08 | 35.68* |
BFP16 | bfloat16 | W8A16 | 67.84 | 78.94 |
BFP16 | BFP16 | W8A8 | 68.15 | 79.09 |
Table 2: Average score on HuggingFace Open LLM Leaderboard v1 with BFP16 weight and activation quantization for Llama3-Instruct, compared to other schemes.
*The 35.68 value for LLM.int8() on the 70B model is unexpectedly low and could potentially be attributed to a bug or configuration issue.
The results show that quantizing both weights and activations to BFP16 results in no degradation for Llama3-8B compared to bfloat16, and just 0.07% absolute degradation for Llama3-70B. This quantization scheme essentially enables 9-bit weights and 9-bit activations (W9A9), which provides up to an 8x improvement to compute performance on FPGAs with native BFP16 computation support, compared to floating point 16 computation. Furthermore, BFP16 achieves a reduction in logic resources – specifically Multiply-Accumulate (MAC) units – in a typical FPGA of approximately 8x compared with FP16/bfloat16. This is attributed to the efficiency of integer operations compared to floating point, as well as the reduced data bus width.
These results show that the BFP16 format mitigates outliers and improves dynamic range over fixed-point arithmetic. Unlike INT8, which fails to manage outliers adequately [13] and necessitates methods like LLM.int8() to use mixed precision, BFP16 handles these outliers effectively, resulting in minimal quality loss.
Unexpectedly, accuracy increases with increased quantization from W9A16 to W9A9. We have observed ±0.5% accuracy fluctuations due to noise and factors such as upgrading the Eleuther.ai evaluation harness, indicating these differences are within a margin of error and that BFP16 quantization results in minimal to zero degradation.
2.2.2 Comparison to FP8 and LLM.int8()
BFP16 maintains higher precision and dynamic range than standard FP8, and achieves comparable accuracies to vLLM FP8. However, vLLM’s FP8 quantization scheme includes FP32 weight and activation scale factors and does not quantize the linear projection to the vocabulary. On the other hand, our BFP16 quantization scheme achieves zero quality degradation through a simple PTQ flow with RTN quantization, without the need for complex runtime adjustments.
Given the ±0.5% fluctuations observed in our measurements, BFP16 and FP8 achieve very similar accuracy levels, making them essentially equivalent in quality.
For Llama3-8B, BFP16 and LLM.int8() achieve similar scores. However, BFP16 greatly outperforms LLM.int8() for Llama3-70B.
2.2.3 Further Compression to BFP12
Precision | Average Score Llama3-8B |
||
Weights | Activations | WxAx | |
bfloat16 | bfloat16 | W16A16 | 68.11 |
BFP12 | bfloat16 | W4½ A16 | 29.87 |
BFP12 | BFP12 | W4½ A4½ | 29.59 |
Table 3: Average score on HuggingFace Open LLM Leaderboard v1 with BFP12 weight and activation quantization for Llama3-Instruct 8B.
We repeated our experiments using the BFP12 format, reducing the integer representation to 4 bits and doubling the block size of the 8-bit exponent to 16. Table 3 illustrates a notable drop in accuracy when utilising BFP12 RTN quantization e.g. BFP12 weights and activations result in a 57% relative decrease in quality for the 8B variant compared to the bfloat16 baseline. Block floating point 12 cannot sufficiently represent the wide range of activation values, especially the outliers, due to its limited precision.
3. Conclusion
The successful implementation of BFP16 weight and activation quantization demonstrates the potential for substantial improvements in computational efficiency for LLMs without significant loss of accuracy. This advancement enables deploying powerful language models in resource-constrained environments.
Converting from W16A16 to W9A9 with BFP16 provides up to an 8x inference speedup on AI-optimized FPGAs and up to 50% memory savings from the model’s weights. BFP16 also provides an 8-fold reduction in MAC surface area compared to fp16. This makes real-time LLM applications more feasible and cost-effective.
BFP16 emerges as a compelling alternative to FP8 for silicon devices with native support. With an effective bit rate of 9 bits, BFP16 achieves nearly the same compression ratio as FP8. In addition, unlike FP8, BFP16 doesn’t require outlier scaling factors in order to preserve accuracy due to its enhanced mantissa and shared exponent configuration.
This research provides immediate performance gains and also opens new avenues for further research in model compression and optimization, by showcasing efficient activation quantization.
4. References
[1] T. B. Brown et al., “Language Models are Few-Shot Learners”, arXiv [cs.CL]. 2020.
[2] H. Touvron et al., “LLaMA: Open and Efficient Foundation Language Models”, arXiv [cs.CL]. 2023.
[3] A. Dubey et al., ‘The Llama 3 Herd of Models’, arXiv [cs.AI]. 2024.
[4] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers”, arXiv [cs.LG]. 2023.
[5] A. Tseng, J. Chee, Q. Sun, V. Kuleshov, and C. D. Sa, “QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks”, arXiv [cs.LG]. 2024.
[6] V. Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, and D. Alistarh, “Extreme Compression of Large Language Models via Additive Quantization”, arXiv [cs.LG]. 2024.
[7] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models”, arXiv [cs.CL]. 2024.
[8] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale”, arXiv [cs.LG]. 2022.
[9] P. Micikevicius et al., “FP8 Formats for Deep Learning”, arXiv [cs.LG]. 2022.
[10] “FP8 Quantization”, vLLM Documentation, [Online]. Available: https://docs.vllm.ai/en/stable/quantization/fp8.html. [Accessed: 30-Jul-2024].
[11] Z. Song, Z. Liu, and D. Wang, “Computation Error Analysis of Block Floating Point Arithmetic Oriented Convolution Neural Network Accelerator Design”, arXiv [cs.LG]. 2017.
[12] B. Darvish Rouhani et al., “Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point”, in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 10271–10281.
[13] A. Kuzmin, M. Van Baalen, Y. Ren, M. Nagel, J. Peters, and T. Blankevoort, “FP8 Quantization: The Power of the Exponent”, arXiv [cs.LG]. 2024.