What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the significant performance degradation of large language models on complex reasoning tasks under low-bit quantization. The authors propose Reasoning-QAT, a quantization-aware training framework tailored for reasoning optimization, which integrates knowledge distillation, post-training quantization (PTQ) initialization, reinforcement learning fine-tuning, and domain-aligned calibration. Notably, this approach achieves the first stable and effective reinforcement learning fine-tuning at 2-bit precision, demonstrating the general efficacy of knowledge distillation in QAT and the advantage of PTQ as a strong initialization strategy. Experimental results show that Reasoning-QAT substantially outperforms existing PTQ methods across multiple models and reasoning benchmarks, achieving a 44.53% improvement over GPTQ on the MATH-500 dataset with Qwen3-0.6B.

Technology Category

Application Category

📝 Abstract

Reasoning models excel at complex tasks such as coding and mathematics, yet their inference is often slow and token-inefficient. To improve the inference efficiency, post-training quantization (PTQ) usually comes with the cost of large accuracy drops, especially for reasoning tasks under low-bit settings. In this study, we present a systematic empirical study of quantization-aware training (QAT) for reasoning models. Our key findings include: (1) Knowledge distillation is a robust objective for reasoning models trained via either supervised fine-tuning or reinforcement learning; (2) PTQ provides a strong initialization for QAT, improving accuracy while reducing training cost; (3) Reinforcement learning remains feasible for quantized models given a viable cold start and yields additional gains; and (4) Aligning the PTQ calibration domain with the QAT training domain accelerates convergence and often improves the final accuracy. Finally, we consolidate these findings into an optimized workflow (Reasoning-QAT), and show that it consistently outperforms state-of-the-art PTQ methods across multiple LLM backbones and reasoning datasets. For instance, on Qwen3-0.6B, it surpasses GPTQ by 44.53% on MATH-500 and consistently recovers performance in the 2-bit regime.

Problem

Research questions and friction points this paper is trying to address.

low-bit quantization

reasoning LLMs

accuracy drop

inference efficiency

quantization-aware training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantization-Aware Training

Reasoning LLMs

Knowledge Distillation

Post-Training Quantization

Reinforcement Learning

🔎 Similar Papers

No similar papers found.

Authors to Follow