From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) in real-world settings critically depends on scarce, noisy high-quality annotations; existing unsupervised approaches—e.g., global entropy minimization—tend to overfit erroneous labels, degrading the reward ranking signals essential for GRPO. Method: We propose a two-stage token-level entropy optimization framework: an exploration stage that maximizes token entropy to improve robustness of reward gradient estimation, followed by an exploitation stage that minimizes entropy to stabilize policy convergence. The method holistically integrates external rewards, internal consistency constraints, and entropy regularization. Contribution/Results: Evaluated on Qwen2-VL and Qwen2.5-VL, our approach significantly enhances noise robustness across diverse noise regimes and tasks, outperforming state-of-the-art methods with more stable and superior performance.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) is highly dependent on high-quality labeled data, which is often scarce and prone to substantial annotation noise in real-world scenarios. Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels and limit the crucial reward ranking signal for Group-Relative Policy Optimization (GRPO). To address these challenges and enhance noise tolerance, we propose a novel two-stage, token-level entropy optimization method for RLVR. This approach dynamically guides the model from exploration to exploitation during training. In the initial exploration phase, token-level entropy maximization promotes diverse and stochastic output generation, serving as a strong regularizer that prevents premature convergence to noisy labels and ensures sufficient intra-group variation, which enables more reliable reward gradient estimation in GRPO. As training progresses, the method transitions into the exploitation phase, where token-level entropy minimization encourages the model to produce confident and deterministic outputs, thereby consolidating acquired knowledge and refining prediction accuracy. Empirically, across three MLLM backbones - Qwen2-VL-2B, Qwen2-VL-7B, and Qwen2.5-VL-3B - spanning diverse noise settings and multiple tasks, our phased strategy consistently outperforms prior approaches by unifying and enhancing external, internal, and entropy-based methods, delivering robust and superior performance across the board.
Problem

Research questions and friction points this paper is trying to address.

Addresses noise sensitivity in MLLM training with RLVR under noisy labels
Prevents overfitting to incorrect labels via dynamic entropy optimization
Enhances reward ranking reliability for GRPO through staged training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage token-level entropy optimization for RLVR
Dynamic transition from exploration to exploitation
Enhanced noise tolerance in MLLM training
🔎 Similar Papers
No similar papers found.
D
Donglai Xu
Independent Researcher
H
Hongzheng Yang
The Chinese University of Hong Kong
Yuzhi Zhao
Yuzhi Zhao
Ph.D., City University of Hong Kong; B.Eng., Huazhong University of Science and Technology
Low-level VisionComputational PhotographyLLMMLLM
P
Pingping Zhang
City University of Hong Kong
Jinpeng Chen
Jinpeng Chen
City University of Hong Kong
Continual LearningMultimodal Large Language Model
W
Wenao Ma
The Chinese University of Hong Kong
Z
Zhijian Hou
City University of Hong Kong
Mengyang Wu
Mengyang Wu
The Chinese University of Hong Kong
MLLM3D Vision
X
Xiaolei Li
Hong Kong University of Science and Technology
S
Senkang Hu
City University of Hong Kong
Z
Ziyi Guan
University of Hong Kong
Jason Chun Lok Li
Jason Chun Lok Li
The University of Hong Kong
AgentEfficient Neural NetworksCompressionImplicit Neural Representation
Lai Man Po
Lai Man Po
City University of Hong Kong