🤖 AI Summary
This work addresses the significant accuracy degradation observed when applying block floating-point (BFP) quantization to attention layers in large language models, which hinders improvements in memory and computational efficiency. To overcome this challenge, the authors propose a co-designed algorithm-hardware framework that enables full-layer BFP activation for the first time. Accuracy is preserved through asymmetric bit allocation and a hybrid offline-online outlier smoothing technique. Complementing this, a reconfigurable hardware architecture is introduced, supporting mixed data formats and integrating real-time FP16-to-BFP conversion with tiled dataflow optimization. Experimental results demonstrate that, compared to prior approaches, the proposed method compresses KV cache to 4-bit mantissas with only 0.3% accuracy loss while achieving 3.84× higher area efficiency, 2.03× better energy efficiency, and 3.08× faster inference speed.
📝 Abstract
Large Language Models (LLMs) are powerful but incur high memory and computation costs. Quantization is an effective solution, with INT weights and FP activations being widely adopted to preserve accuracy. Prior works further reduce FP overhead by using block floating point (BFP) activations in linear layers, but fail to extend BFP to attention layers due to severe accuracy degradation, limiting overall efficiency. To address this challenge, we propose Harmonia, an algorithm-hardware co-design framework that enables all-layer BFP activations with a configurable hardware architecture. First, we systematically explore BFP configurations to achieve a better trade-off between accuracy and activation compression across all layers. Second, to reduce KV-cache storage and computation in attention layers, we introduce an asymmetric bit-allocation strategy and computations in attention layers,we introduce an asymmetric bit-allocation strategy combined with a hybrid offline-online outlier smoothing technique. This allow aggressive KV-cache compression from FP16 to 4-bit-mantissa BFP with only 0.3% average accuracy loss. Third, to fully exploit all-layer BFP activations, we design dedicated hardware components, including a reconfigurable PE supporting mixed data formats (BFP-INT and BPF-BFP), a real-time FP16-to-BFP converter, and a tiling-aware dataflow to reduce memory traffic. We evaluate Harmonia on GEMM operations in both linear and attention layers across eight widely used LLMs. Compared with prior works, Harmonia achieves 3.84x (up to 5.05x) higher area efficiency, 2.03x (up to 3.90x) better energy efficiency, and 3.08x (up to 4.62x) speedup on average.