The Impact of Quantization on Large Reasoning Model Reinforcement Learning

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically investigates the impact of quantization on the reinforcement learning (RL) performance of large reasoning models (LRMs), focusing on three post-training strategies: post-training quantization (PTQ), quantization-aware training (QAT), and QLoRA—specifically within unsupervised fine-tuning settings. Experimental results reveal that QAT severely degrades mathematical reasoning capabilities in RL, whereas PTQ and QLoRA not only preserve model accuracy but also yield an average +3.2% performance gain on benchmarks including GSM8K and MATH. This study presents the first empirical evidence that QLoRA exhibits superior efficiency and robustness in RL contexts compared to conventional quantization methods. Furthermore, it introduces a lightweight joint quantization–reinforcement optimization paradigm tailored for LRMs, enabling effective deployment under resource-constrained conditions. The findings provide actionable insights for balancing computational efficiency and reasoning fidelity in quantized RL-based inference systems.

Technology Category

Application Category

📝 Abstract
Strong reasoning capabilities can now be achieved by large-scale reinforcement learning (RL) without any supervised fine-tuning. Although post-training quantization (PTQ) and quantization-aware training (QAT) are well studied in the context of fine-tuning, how quantization impacts RL in large reasoning models (LRMs) remains an open question. To answer this question, we conducted systematic experiments and discovered a significant gap in reasoning performance on mathematical benchmarks between post-RL quantized models and their quantization-aware RL optimized counterparts. Our findings suggest that quantization-aware RL training negatively impacted the learning process, whereas PTQ and QLoRA led to greater performance.
Problem

Research questions and friction points this paper is trying to address.

Quantization impacts reinforcement learning in large reasoning models
Performance gap exists between post-RL quantization and quantization-aware training
Quantization-aware RL negatively affects learning while PTQ and QLoRA improve performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantization-aware RL training negatively impacts learning
Post-training quantization improves reasoning model performance
QLoRA method enhances quantized model reasoning capabilities
🔎 Similar Papers
No similar papers found.