🤖 AI Summary
To address the instability and frequent collapse of single-rollout reinforcement learning (RL) in multimodal large language models (MLLMs) inference optimization under the RLVR framework, this paper proposes the Multimodal Single-Sample Reinforcement (MSSR) framework. Its core innovation is the first entropy-driven adaptive advantage shaping mechanism, theoretically and empirically demonstrated—both for the first time—to be indispensable for stabilizing single-sample RL in multimodal settings. MSSR integrates entropy regularization, dynamic advantage scaling, and a single-rollout policy. Under guaranteed convergence, it substantially improves training efficiency: achieving baseline accuracy in half the training steps; and consistently outperforming population-sampling baselines (e.g., GRPO) across five inference-intensive benchmarks at equal step counts, with markedly enhanced generalization performance.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this training efficiency-stability trade-off, we introduce $ extbf{MSSR}$ (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps. When trained for the same number of steps, MSSR's performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks.