HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient co-optimization of hardware timing and energy efficiency in large language model (LLM) inference, this paper proposes the first hardware-aware post-training quantization framework. Our method innovatively incorporates circuit-level critical-path delay modeling into quantization design, jointly optimizing multiply-accumulate (MAC) unit timing and energy characteristics while supporting dynamic frequency scaling. Furthermore, we introduce latency-sensitive weight reordering to enable low-latency weight deployment—overcoming the limitations of conventional fixed-bitwidth and hardware-agnostic quantization paradigms. Evaluations on TPU and GPU platforms demonstrate an average 270% inference speedup, 51% energy reduction, and negligible accuracy degradation (<0.3%). The framework significantly enhances LLM inference efficiency on heterogeneous accelerators.

Technology Category

Application Category

📝 Abstract
Quantization is critical for realizing efficient inference of LLMs. Traditional quantization methods are hardware-agnostic, limited to bit-width constraints, and lacking circuit-level insights, such as timing and energy characteristics of Multiply-Accumulate (MAC) units. We introduce HALO, a versatile framework that adapts to various hardware through a Hardware-Aware Post-Training Quantization (PTQ) approach. By leveraging MAC unit properties, HALO minimizes critical-path delays and enables dynamic frequency scaling. Deployed on LLM accelerators like TPUs and GPUs, HALO achieves on average 270% performance gains and 51% energy savings, all with minimal accuracy drop.
Problem

Research questions and friction points this paper is trying to address.

Hardware-aware quantization for LLM acceleration
Minimizing critical-path delays in MAC units
Dynamic frequency scaling for energy efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hardware-aware Post-Training Quantization
Leverages MAC unit properties
Dynamic frequency scaling enabled
🔎 Similar Papers
No similar papers found.