Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of error propagation in multimodal large language models (MLLMs) during complex reasoning, which often stems from unstable visual attention and uncorrectable early visual misalignment. To mitigate this, the authors propose SAYO, a novel model that introduces, for the first time in MLLMs, a reinforcement learning reward mechanism grounded in region-level visual attention. This approach explicitly aligns optimization signals with visually grounded reasoning steps, effectively alleviating the credit assignment problem in attention mechanisms. Integrated with chain-of-thought reasoning, SAYO significantly enhances performance across multiple multimodal benchmarks, demonstrating robust improvements in both perception-intensive and diverse reasoning tasks.

Technology Category

Application Category

📝 Abstract

While chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks, existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies. Our analysis shows that current MLLMs exhibit weak visual focus: early-stage visual misalignment is rarely corrected during subsequent reasoning, leading to error propagation and failed inferences. We argue that this limitation stems from inadequate credit assignment for visual attention during training. To address this issue, we propose SAYO, a visual reasoning model trained with a reinforcement learning (RL) framework that introduces a region-level visual attention-based reward. This reward explicitly aligns optimization signals with visually grounded reasoning steps, enabling the model to learn more reliable attention behaviors. Extensive experiments across multiple multimodal benchmarks demonstrate that SAYO consistently improves performance on diverse reasoning and perception tasks.

Problem

Research questions and friction points this paper is trying to address.

multimodal LLMs

visual attention

credit assignment

reasoning

error propagation

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual attention

reinforcement learning

multimodal LLMs

region-level reward

credit assignment

🔎 Similar Papers

No similar papers found.

Authors to Follow