RoRecomp: Enhancing Reasoning Efficiency via Rollout Response Recomposition in Reinforcement Learning

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

In RLVR frameworks, large language models suffer from excessively long inference sequences and inefficient exploration—stemming from outcome-oriented rewards that fail to incentivize efficiency, and high variance in rollout response lengths that introduce noisy optimization signals. Method: We propose Response Reorganization, a novel training paradigm that partitions rollouts into priority batches (to reinforce gradient signals for conciseness) and compensation batches (to ensure training stability), integrated with verifiable rewards, online rollout batching, and a replay buffer mechanism. Contribution/Results: Without any RL fine-tuning, our method reduces average inference length by 27.7%; on proxy tasks, it decreases tool calls by 46.8% while improving accuracy, and achieves a 52.5% chain-of-thought compression rate. To our knowledge, this is the first approach to jointly optimize inference efficiency and task performance in RLVR.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) has proven effective in eliciting complex reasoning in large language models (LLMs). However, standard RLVR training often leads to excessively verbose processes (in reasoning tasks) and inefficient exploration trajectories (in agentic settings), as outcome-only rewards provide no incentive for efficiency and the high variance in response length within relatively small rollout groups results in noisy optimization signals. To address this, we propose Rollout Response Recomposition (RoRecomp), a plug-and-play method that guides models toward concise reasoning by strategically recomposing the training data. RoRecomp separates responses into two distinct batch types: 1) priority batches, which combine short-correct and long-incorrect responses selected from online batches to provide a clear gradient signal for brevity, and 2) compensation batches, which utilize remaining responses from a replay buffer to maintain stability and prevent model collapse. To comprehensively evaluate effectiveness, we test RoRecomp across three settings where results demonstrate substantial efficiency gains: reducing reasoning length by 27.7% in zero RL training, reducing unnecessary tool calls by 46.8% while improving accuracy in agentic RL, and achieving up to 52.5% length reduction in thinking compression, all with minimal performance impact.

Problem

Research questions and friction points this paper is trying to address.

Reduces verbose reasoning processes in language models

Improves inefficient exploration in reinforcement learning

Addresses noisy optimization from variable response lengths

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rollout Response Recomposition for concise reasoning guidance

Priority batches combine short-correct and long-incorrect responses

Compensation batches maintain stability using replay buffer responses

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL