🤖 AI Summary
This work addresses the challenge of error propagation in multimodal large language models (MLLMs) during complex reasoning, which often stems from unstable visual attention and uncorrectable early visual misalignment. To mitigate this, the authors propose SAYO, a novel model that introduces, for the first time in MLLMs, a reinforcement learning reward mechanism grounded in region-level visual attention. This approach explicitly aligns optimization signals with visually grounded reasoning steps, effectively alleviating the credit assignment problem in attention mechanisms. Integrated with chain-of-thought reasoning, SAYO significantly enhances performance across multiple multimodal benchmarks, demonstrating robust improvements in both perception-intensive and diverse reasoning tasks.
📝 Abstract
While chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks, existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies. Our analysis shows that current MLLMs exhibit weak visual focus: early-stage visual misalignment is rarely corrected during subsequent reasoning, leading to error propagation and failed inferences. We argue that this limitation stems from inadequate credit assignment for visual attention during training. To address this issue, we propose SAYO, a visual reasoning model trained with a reinforcement learning (RL) framework that introduces a region-level visual attention-based reward. This reward explicitly aligns optimization signals with visually grounded reasoning steps, enabling the model to learn more reliable attention behaviors. Extensive experiments across multiple multimodal benchmarks demonstrate that SAYO consistently improves performance on diverse reasoning and perception tasks.