Stable and Efficient Single-Rollout RL for Multimodal Reasoning

📅 2025-12-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the instability and frequent collapse of single-rollout reinforcement learning (RL) in multimodal large language models (MLLMs) inference optimization under the RLVR framework, this paper proposes the Multimodal Single-Sample Reinforcement (MSSR) framework. Its core innovation is the first entropy-driven adaptive advantage shaping mechanism, theoretically and empirically demonstrated—both for the first time—to be indispensable for stabilizing single-sample RL in multimodal settings. MSSR integrates entropy regularization, dynamic advantage scaling, and a single-rollout policy. Under guaranteed convergence, it substantially improves training efficiency: achieving baseline accuracy in half the training steps; and consistently outperforming population-sampling baselines (e.g., GRPO) across five inference-intensive benchmarks at equal step counts, with markedly enhanced generalization performance.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this training efficiency-stability trade-off, we introduce $ extbf{MSSR}$ (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps. When trained for the same number of steps, MSSR's performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Stable single-rollout RL for multimodal reasoning
Addresses training collapse in multimodal RLVR
Improves efficiency and generalization in MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Stabilized Single-Rollout RL framework
Entropy-based advantage-shaping mechanism for stability
Group-free RLVR with adaptive regularization
🔎 Similar Papers
No similar papers found.