ZeroQAT: Your Quantization-aware Training but Efficient

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing low-bit quantization of large language models (LLMs) faces a fundamental trade-off: post-training quantization (PTQ) is computationally efficient but suffers from significant accuracy degradation, whereas quantization-aware training (QAT) achieves high accuracy at the cost of prohibitive memory and computational overhead due to backpropagation—hindering practical deployment. This paper proposes the first backpropagation-free QAT framework, introducing zeroth-order optimization and forward gradient estimation to jointly optimize quantized weights, activation clipping thresholds, and equivalent affine transformations. This co-optimization simultaneously mitigates quantization error and suppresses the impact of activation outliers. Experiments demonstrate that our method attains QAT-level accuracy at 4–6 bits while reducing memory footprint and computational cost to levels comparable with PTQ. Consequently, it substantially enhances the practicality and deployability of low-bit LLM quantization.

Technology Category

Application Category

📝 Abstract
Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing low-bit PTQ methods suffer from accuracy degradation because their layer-wise optimization introduces cumulative error propagation and misalignment between local reconstruction objectives and downstream performance. While quantization-aware training (QAT) provides a principled solution, its reliance on backpropagation incurs prohibitive data, time, and memory costs, limiting its practicality. To address these challenges, we propose ZeroQAT, a zeroth-order optimization-based QAT framework. ZeroQAT leverages forward-only gradient estimation to eliminate the need for backpropagation, significantly reducing computational and memory overhead while retaining the benefits of end-to-end optimization. Moreover, ZeroQAT jointly learns quantized weights, weight clipping thresholds, and equivalent transformations to mitigate quantization error and handle activation outliers. Experiments demonstrate that ZeroQAT achieves the efficiency of PTQ while retaining the accuracy of QAT, offering a practical solution for high-quality low-bit quantization of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Reduces memory costs for quantization-aware training of large language models
Enables low-bit quantization fine-tuning on resource-constrained edge devices
Eliminates backpropagation dependency while maintaining end-to-end optimization benefits
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zeroth-order optimization eliminates backpropagation for QAT
Forward-only gradient estimation reduces memory overhead
Lightweight variant freezes most parameters for edge deployment
🔎 Similar Papers
No similar papers found.