🤖 AI Summary
To address high GPU memory consumption, slow inference, and insufficient exploration in reinforcement learning (RL) training of large language models (LLMs), this paper proposes QeRL—a quantization-enhanced RL framework. Methodologically, QeRL leverages NVFP4 quantization not merely for efficiency but as a source of beneficial stochasticity: it is the first to observe that NVFP4-induced noise increases policy entropy, thereby improving exploration. Building on this insight, QeRL introduces Adaptive Quantization Noise (AQN), a mechanism that dynamically modulates noise intensity during training. Furthermore, it integrates NVFP4 quantization with LoRA for memory-efficient and accelerated rollouts. Evaluated on a single H100 80GB GPU, QeRL successfully trains a 32B-parameter LLM, achieving over 1.5× faster rollout throughput and substantially reduced memory footprint. It attains 90.8% accuracy on GSM8K and 77.4% on MATH—performance competitive with full-parameter fine-tuning.
📝 Abstract
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.