Towards Better RL Training Data Utilization via Second-Order Rollout

📅 2026-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of traditional reinforcement learning in large language model training, which relies solely on first-order rollouts—generating multiple responses per prompt—while neglecting the joint optimization of critique capabilities. To overcome this limitation, the authors propose a second-order rollout mechanism that generates multiple critiques for each individual response, enabling a unified framework for simultaneously refining both generation and critique abilities. The approach innovatively integrates dynamic data augmentation, multi-critique sampling, and a sampling-based reward denoising technique. It also provides the first systematic analysis of key challenges in critique training, including label imbalance and reward noise. Experimental results demonstrate that, under identical training data conditions, the proposed method significantly outperforms conventional reinforcement learning across multiple models and datasets.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple responses for a question), and we argue that this approach fails to fully exploit the potential of training data because of the neglect of critique capability training. To tackle this problem, we further introduce the concept of second-order rollout (generating multiple critiques for a response) and propose a unified framework for jointly training generation and critique capabilities. Extensive experiments across various models and datasets demonstrate that our approach can utilize training data more effectively than vanilla RL and achieve better performance under the same training data. Additionally, we uncover several insightful findings regarding second-order rollout and critique training, such as the importance of label balance in critique training and the noise problem of outcome-based rewards, which can be mitigated through sampling techniques. Our work offers a preliminary exploration of dynamic data augmentation and joint generation-critique training in RL, providing meaningful inspiration for the further advancement of RL training
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Large Language Models
critique capability
training data utilization
second-order rollout
Innovation

Methods, ideas, or system contributions that make the work stand out.

second-order rollout
critique training
joint generation-critique learning
reinforcement learning
data utilization
🔎 Similar Papers
No similar papers found.
Zhe Yang
Zhe Yang
Peking University
Natural Language ProcessingMachine Learning
Yudong Wang
Yudong Wang
Peking University
NLPLLMdeep learningmachine learning
R
Rang Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Z
Zhifang Sui
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University