Stable and Efficient Single-Rollout RL for Multimodal Reasoning

📅 2025-12-20

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address the instability and frequent collapse of single-rollout reinforcement learning (RL) in multimodal large language models (MLLMs) inference optimization under the RLVR framework, this paper proposes the Multimodal Single-Sample Reinforcement (MSSR) framework. Its core innovation is the first entropy-driven adaptive advantage shaping mechanism, theoretically and empirically demonstrated—both for the first time—to be indispensable for stabilizing single-sample RL in multimodal settings. MSSR integrates entropy regularization, dynamic advantage scaling, and a single-rollout policy. Under guaranteed convergence, it substantially improves training efficiency: achieving baseline accuracy in half the training steps; and consistently outperforming population-sampling baselines (e.g., GRPO) across five inference-intensive benchmarks at equal step counts, with markedly enhanced generalization performance.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this training efficiency-stability trade-off, we introduce $ extbf{MSSR}$ (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps. When trained for the same number of steps, MSSR's performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

Stable single-rollout RL for multimodal reasoning

Addresses training collapse in multimodal RLVR

Improves efficiency and generalization in MLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Stabilized Single-Rollout RL framework

Entropy-based advantage-shaping mechanism for stability

Group-free RLVR with adaptive regularization

🔎 Similar Papers

Zero-shot cross-modal transfer of Reinforcement Learning policies through a Global Workspace