Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of error propagation in multimodal large language models (MLLMs) during complex reasoning, which often stems from unstable visual attention and uncorrectable early visual misalignment. To mitigate this, the authors propose SAYO, a novel model that introduces, for the first time in MLLMs, a reinforcement learning reward mechanism grounded in region-level visual attention. This approach explicitly aligns optimization signals with visually grounded reasoning steps, effectively alleviating the credit assignment problem in attention mechanisms. Integrated with chain-of-thought reasoning, SAYO significantly enhances performance across multiple multimodal benchmarks, demonstrating robust improvements in both perception-intensive and diverse reasoning tasks.

Technology Category

Application Category

📝 Abstract
While chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks, existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies. Our analysis shows that current MLLMs exhibit weak visual focus: early-stage visual misalignment is rarely corrected during subsequent reasoning, leading to error propagation and failed inferences. We argue that this limitation stems from inadequate credit assignment for visual attention during training. To address this issue, we propose SAYO, a visual reasoning model trained with a reinforcement learning (RL) framework that introduces a region-level visual attention-based reward. This reward explicitly aligns optimization signals with visually grounded reasoning steps, enabling the model to learn more reliable attention behaviors. Extensive experiments across multiple multimodal benchmarks demonstrate that SAYO consistently improves performance on diverse reasoning and perception tasks.
Problem

Research questions and friction points this paper is trying to address.

multimodal LLMs
visual attention
credit assignment
reasoning
error propagation
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual attention
reinforcement learning
multimodal LLMs
region-level reward
credit assignment
🔎 Similar Papers
No similar papers found.
S
Siqu Ou
TeleAI; Shanghai Jiao Tong University
T
Tianrui Wan
TeleAI; Northwestern Polytechnical University
Z
Zhiyuan Zhao
TeleAI
Junyu Gao
Junyu Gao
NWPU
Computer VisionMachine LearningCrowd Analysis
X
Xuelong Li
TeleAI