What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant performance degradation of large language models on complex reasoning tasks under low-bit quantization. The authors propose Reasoning-QAT, a quantization-aware training framework tailored for reasoning optimization, which integrates knowledge distillation, post-training quantization (PTQ) initialization, reinforcement learning fine-tuning, and domain-aligned calibration. Notably, this approach achieves the first stable and effective reinforcement learning fine-tuning at 2-bit precision, demonstrating the general efficacy of knowledge distillation in QAT and the advantage of PTQ as a strong initialization strategy. Experimental results show that Reasoning-QAT substantially outperforms existing PTQ methods across multiple models and reasoning benchmarks, achieving a 44.53% improvement over GPTQ on the MATH-500 dataset with Qwen3-0.6B.

Technology Category

Application Category

📝 Abstract
Reasoning models excel at complex tasks such as coding and mathematics, yet their inference is often slow and token-inefficient. To improve the inference efficiency, post-training quantization (PTQ) usually comes with the cost of large accuracy drops, especially for reasoning tasks under low-bit settings. In this study, we present a systematic empirical study of quantization-aware training (QAT) for reasoning models. Our key findings include: (1) Knowledge distillation is a robust objective for reasoning models trained via either supervised fine-tuning or reinforcement learning; (2) PTQ provides a strong initialization for QAT, improving accuracy while reducing training cost; (3) Reinforcement learning remains feasible for quantized models given a viable cold start and yields additional gains; and (4) Aligning the PTQ calibration domain with the QAT training domain accelerates convergence and often improves the final accuracy. Finally, we consolidate these findings into an optimized workflow (Reasoning-QAT), and show that it consistently outperforms state-of-the-art PTQ methods across multiple LLM backbones and reasoning datasets. For instance, on Qwen3-0.6B, it surpasses GPTQ by 44.53% on MATH-500 and consistently recovers performance in the 2-bit regime.
Problem

Research questions and friction points this paper is trying to address.

low-bit quantization
reasoning LLMs
accuracy drop
inference efficiency
quantization-aware training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantization-Aware Training
Reasoning LLMs
Knowledge Distillation
Post-Training Quantization
Reinforcement Learning
🔎 Similar Papers
No similar papers found.
K
Keyu Lv
Shenzhen International Graduate School, Tsinghua University
M
Manyi Zhang
Huawei Technologies
Xiaobo Xia
Xiaobo Xia
Postdoc, National University of Singapore
Data-Centric AITrustworthy AIMachine LearningMultimodal LearningAI4Science
J
Jingchen Ni
Shenzhen International Graduate School, Tsinghua University
S
Shannan Yan
Shenzhen International Graduate School, Tsinghua University
Xianzhi Yu
Xianzhi Yu
Unknown affiliation
AIHPC
L
Lu Hou
Huawei Technologies
Chun Yuan
Chun Yuan
Graduate School at Shenzhen, Tsinghua University
Computer visionmultimedia access control
Haoli Bai
Haoli Bai
Huawei Technologies
natural language processingmodel compression