🤖 AI Summary
Existing low-bit quantization of large language models (LLMs) faces a fundamental trade-off: post-training quantization (PTQ) is computationally efficient but suffers from significant accuracy degradation, whereas quantization-aware training (QAT) achieves high accuracy at the cost of prohibitive memory and computational overhead due to backpropagation—hindering practical deployment. This paper proposes the first backpropagation-free QAT framework, introducing zeroth-order optimization and forward gradient estimation to jointly optimize quantized weights, activation clipping thresholds, and equivalent affine transformations. This co-optimization simultaneously mitigates quantization error and suppresses the impact of activation outliers. Experiments demonstrate that our method attains QAT-level accuracy at 4–6 bits while reducing memory footprint and computational cost to levels comparable with PTQ. Consequently, it substantially enhances the practicality and deployability of low-bit LLM quantization.
📝 Abstract
Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing low-bit PTQ methods suffer from accuracy degradation because their layer-wise optimization introduces cumulative error propagation and misalignment between local reconstruction objectives and downstream performance. While quantization-aware training (QAT) provides a principled solution, its reliance on backpropagation incurs prohibitive data, time, and memory costs, limiting its practicality. To address these challenges, we propose ZeroQAT, a zeroth-order optimization-based QAT framework. ZeroQAT leverages forward-only gradient estimation to eliminate the need for backpropagation, significantly reducing computational and memory overhead while retaining the benefits of end-to-end optimization. Moreover, ZeroQAT jointly learns quantized weights, weight clipping thresholds, and equivalent transformations to mitigate quantization error and handle activation outliers. Experiments demonstrate that ZeroQAT achieves the efficiency of PTQ while retaining the accuracy of QAT, offering a practical solution for high-quality low-bit quantization of LLMs.