Resource-Efficient Reinforcement for Reasoning Large Language Models via Dynamic One-Shot Policy Refinement

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Reinforcement learning for training reasoning-capable large language models suffers from high sample and computational costs, particularly in settings reliant on verifiable rewards. To address this challenge, this work proposes Dynamic once-per-batch Policy Optimization with Reward uncertainty awareness (DoPR), which dynamically selects the single most informative sample per batch for policy updates through an uncertainty-aware mechanism. DoPR establishes, for the first time, a theoretical lower bound on the sample complexity required to elicit reasoning capabilities, and integrates reward variance analysis with an exploration-driven sampling strategy to substantially improve training efficiency. Experimental results demonstrate that DoPR achieves competitive reasoning accuracy while reducing rollout costs by nearly an order of magnitude, enabling efficient and scalable post-training of large language models.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have exhibited remarkable performance on complex reasoning tasks, with reinforcement learning under verifiable rewards (RLVR) emerging as a principled framework for aligning model behavior with reasoning chains. Despite its promise, RLVR remains prohibitively resource-intensive, requiring extensive reward signals and incurring substantial rollout costs during training. In this work, we revisit the fundamental question of data and compute efficiency in RLVR. We first establish a theoretical lower bound on the sample complexity required to unlock reasoning capabilities, and empirically validate that strong performance can be achieved with a surprisingly small number of training instances. To tackle the computational burden, we propose Dynamic One-Shot Policy Refinement (DoPR), an uncertainty-aware RL strategy that dynamically selects a single informative training sample per batch for policy updates, guided by reward volatility and exploration-driven acquisition. DoPR reduces rollout overhead by nearly an order of magnitude while preserving competitive reasoning accuracy, offering a scalable and resource-efficient solution for LLM post-training. This approach offers a practical path toward more efficient and accessible RL-based training for reasoning-intensive LLM applications.

Problem

Research questions and friction points this paper is trying to address.

resource-efficient reinforcement learning

reasoning large language models

RLVR

sample complexity

rollout cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic One-Shot Policy Refinement

Resource-Efficient Reinforcement Learning

Sample Efficiency