VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) treat visual inputs as deterministic conditions, overlooking their inherent ambiguity and uncertainty—leading to insufficient exploration and poor policy robustness in multimodal reasoning. This work pioneers a paradigm shift: relocating the focus of exploration from textual output space to visual input space, modeling images as stochastic contexts. We quantify policy sensitivity to visual perturbations via symmetric KL divergence, thereby establishing an uncertainty-aware exploration mechanism. Our method integrates uncertainty-proportional reward, token entropy reward, and annealed sampling within the GRPO reinforcement learning framework. Evaluated on multiple visual mathematical and general multimodal reasoning benchmarks, it achieves average pass@1 improvements of 2.6–3.7%, significantly boosts pass@4 performance, and effectively mitigates exploration decay during reinforcement fine-tuning.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) improves reasoning in large language models (LLMs) but struggles with exploration, an issue that still persists for multimodal LLMs (MLLMs). Current methods treat the visual input as a fixed, deterministic condition, overlooking a critical source of ambiguity and struggling to build policies robust to plausible visual variations. We introduce $ extbf{VOGUE (Visual Uncertainty Guided Exploration)}$, a novel method that shifts exploration from the output (text) to the input (visual) space. By treating the image as a stochastic context, VOGUE quantifies the policy's sensitivity to visual perturbations using the symmetric KL divergence between a "raw" and "noisy" branch, creating a direct signal for uncertainty-aware exploration. This signal shapes the learning objective via an uncertainty-proportional bonus, which, combined with a token-entropy bonus and an annealed sampling schedule, effectively balances exploration and exploitation. Implemented within GRPO on two model scales (Qwen2.5-VL-3B/7B), VOGUE boosts pass@1 accuracy by an average of 2.6% on three visual math benchmarks and 3.7% on three general-domain reasoning benchmarks, while simultaneously increasing pass@4 performance and mitigating the exploration decay commonly observed in RL fine-tuning. Our work shows that grounding exploration in the inherent uncertainty of visual inputs is an effective strategy for improving multimodal reasoning.
Problem

Research questions and friction points this paper is trying to address.

Multimodal LLMs struggle with exploration during reinforcement learning fine-tuning
Current methods treat visual inputs as fixed, overlooking visual ambiguity and variations
Existing approaches fail to build policies robust to plausible visual perturbations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shifts exploration from text to visual input space
Quantifies policy sensitivity to visual perturbations
Uses uncertainty-proportional bonus to shape learning objective
🔎 Similar Papers
No similar papers found.
R
Rui Liu
Tencent AI Lab, Bellevue, WA; University of Maryland, College Park
D
Dian Yu
Tencent AI Lab, Bellevue, WA
Tong Zheng
Tong Zheng
University of Maryland, College Park; Northeastern University
Machine TranslationLanguage ModelingReasoningInference
R
Runpeng Dai
Tencent AI Lab, Bellevue, WA; University of North Carolina, Chapel Hill
Zongxia Li
Zongxia Li
University of Maryland, College Park
Natural Language ProcessingMultimodal Models
W
Wenhao Yu
Tencent AI Lab, Bellevue, WA
Zhenwen Liang
Zhenwen Liang
Tencent AI Lab@Seattle, USA
Natural Language ProcessingMath ReasoningLarge Language Models
L
Linfeng Song
Tencent AI Lab, Bellevue, WA
Haitao Mi
Haitao Mi
Principal Researcher, Tencent US
Large Language Models
Pratap Tokekar
Pratap Tokekar
Associate Professor, University of Maryland
Robotics
D
Dong Yu
Tencent AI Lab, Bellevue, WA