🤖 AI Summary
Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) in real-world settings critically depends on scarce, noisy high-quality annotations; existing unsupervised approaches—e.g., global entropy minimization—tend to overfit erroneous labels, degrading the reward ranking signals essential for GRPO.
Method: We propose a two-stage token-level entropy optimization framework: an exploration stage that maximizes token entropy to improve robustness of reward gradient estimation, followed by an exploitation stage that minimizes entropy to stabilize policy convergence. The method holistically integrates external rewards, internal consistency constraints, and entropy regularization.
Contribution/Results: Evaluated on Qwen2-VL and Qwen2.5-VL, our approach significantly enhances noise robustness across diverse noise regimes and tasks, outperforming state-of-the-art methods with more stable and superior performance.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) is highly dependent on high-quality labeled data, which is often scarce and prone to substantial annotation noise in real-world scenarios. Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels and limit the crucial reward ranking signal for Group-Relative Policy Optimization (GRPO). To address these challenges and enhance noise tolerance, we propose a novel two-stage, token-level entropy optimization method for RLVR. This approach dynamically guides the model from exploration to exploitation during training. In the initial exploration phase, token-level entropy maximization promotes diverse and stochastic output generation, serving as a strong regularizer that prevents premature convergence to noisy labels and ensures sufficient intra-group variation, which enables more reliable reward gradient estimation in GRPO. As training progresses, the method transitions into the exploitation phase, where token-level entropy minimization encourages the model to produce confident and deterministic outputs, thereby consolidating acquired knowledge and refining prediction accuracy. Empirically, across three MLLM backbones - Qwen2-VL-2B, Qwen2-VL-7B, and Qwen2.5-VL-3B - spanning diverse noise settings and multiple tasks, our phased strategy consistently outperforms prior approaches by unifying and enhancing external, internal, and entropy-based methods, delivering robust and superior performance across the board.