🤖 AI Summary
This work addresses the significant performance degradation of large language models on complex reasoning tasks under low-bit quantization. The authors propose Reasoning-QAT, a quantization-aware training framework tailored for reasoning optimization, which integrates knowledge distillation, post-training quantization (PTQ) initialization, reinforcement learning fine-tuning, and domain-aligned calibration. Notably, this approach achieves the first stable and effective reinforcement learning fine-tuning at 2-bit precision, demonstrating the general efficacy of knowledge distillation in QAT and the advantage of PTQ as a strong initialization strategy. Experimental results show that Reasoning-QAT substantially outperforms existing PTQ methods across multiple models and reasoning benchmarks, achieving a 44.53% improvement over GPTQ on the MATH-500 dataset with Qwen3-0.6B.
📝 Abstract
Reasoning models excel at complex tasks such as coding and mathematics, yet their inference is often slow and token-inefficient. To improve the inference efficiency, post-training quantization (PTQ) usually comes with the cost of large accuracy drops, especially for reasoning tasks under low-bit settings. In this study, we present a systematic empirical study of quantization-aware training (QAT) for reasoning models. Our key findings include: (1) Knowledge distillation is a robust objective for reasoning models trained via either supervised fine-tuning or reinforcement learning; (2) PTQ provides a strong initialization for QAT, improving accuracy while reducing training cost; (3) Reinforcement learning remains feasible for quantized models given a viable cold start and yields additional gains; and (4) Aligning the PTQ calibration domain with the QAT training domain accelerates convergence and often improves the final accuracy. Finally, we consolidate these findings into an optimized workflow (Reasoning-QAT), and show that it consistently outperforms state-of-the-art PTQ methods across multiple LLM backbones and reasoning datasets. For instance, on Qwen3-0.6B, it surpasses GPTQ by 44.53% on MATH-500 and consistently recovers performance in the 2-bit regime.