ZeroQAT: Your Quantization-aware Training but Efficient

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing low-bit quantization of large language models (LLMs) faces a fundamental trade-off: post-training quantization (PTQ) is computationally efficient but suffers from significant accuracy degradation, whereas quantization-aware training (QAT) achieves high accuracy at the cost of prohibitive memory and computational overhead due to backpropagation—hindering practical deployment. This paper proposes the first backpropagation-free QAT framework, introducing zeroth-order optimization and forward gradient estimation to jointly optimize quantized weights, activation clipping thresholds, and equivalent affine transformations. This co-optimization simultaneously mitigates quantization error and suppresses the impact of activation outliers. Experiments demonstrate that our method attains QAT-level accuracy at 4–6 bits while reducing memory footprint and computational cost to levels comparable with PTQ. Consequently, it substantially enhances the practicality and deployability of low-bit LLM quantization.

Technology Category

Application Category

📝 Abstract

Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing low-bit PTQ methods suffer from accuracy degradation because their layer-wise optimization introduces cumulative error propagation and misalignment between local reconstruction objectives and downstream performance. While quantization-aware training (QAT) provides a principled solution, its reliance on backpropagation incurs prohibitive data, time, and memory costs, limiting its practicality. To address these challenges, we propose ZeroQAT, a zeroth-order optimization-based QAT framework. ZeroQAT leverages forward-only gradient estimation to eliminate the need for backpropagation, significantly reducing computational and memory overhead while retaining the benefits of end-to-end optimization. Moreover, ZeroQAT jointly learns quantized weights, weight clipping thresholds, and equivalent transformations to mitigate quantization error and handle activation outliers. Experiments demonstrate that ZeroQAT achieves the efficiency of PTQ while retaining the accuracy of QAT, offering a practical solution for high-quality low-bit quantization of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Reduces memory costs for quantization-aware training of large language models

Enables low-bit quantization fine-tuning on resource-constrained edge devices

Eliminates backpropagation dependency while maintaining end-to-end optimization benefits

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zeroth-order optimization eliminates backpropagation for QAT

Forward-only gradient estimation reduces memory overhead

Lightweight variant freezes most parameters for edge deployment

🔎 Similar Papers

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration