From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

📅 2026-03-04

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the performance limitations of multimodal large reasoning models during the cold-start phase, which often stem from dispersed visual attention. The authors identify and term this issue “lazy attention localization,” and propose a training-free attention modulation method within a novel cold-start framework named AVAR. This framework integrates synthetic visually anchored data, an attention-guided optimization objective, and a reward mechanism. To quantify visual focus, they introduce the Visual Attention Score (VAS) and demonstrate its strong correlation with reasoning performance. Evaluated on Qwen2.5-VL-7B, the approach achieves an average improvement of 7.0% across seven benchmarks. Ablation studies confirm the effectiveness of each component, with individual interventions yielding consistent gains of 1–2%.

Technology Category

Application Category

📝 Abstract

The cold-start initialization stage plays a pivotal role in training Multimodal Large Reasoning Models (MLRMs), yet its mechanisms remain insufficiently understood. To analyze this stage, we introduce the Visual Attention Score (VAS), an attention-based metric that quantifies how much a model attends to visual tokens. We find that reasoning performance is strongly correlated with VAS (r=0.9616): models with higher VAS achieve substantially stronger multimodal reasoning. Surprisingly, multimodal cold-start fails to elevate VAS, resulting in attention distributions close to the base model, whereas text-only cold-start leads to a clear increase. We term this counter-intuitive phenomenon Lazy Attention Localization. To validate its causal role, we design training-free interventions that directly modulate attention allocation during inference, performance gains of 1$-$2% without any retraining. Building on these insights, we further propose Attention-Guided Visual Anchoring and Reflection (AVAR), a comprehensive cold-start framework that integrates visual-anchored data synthesis, attention-guided objectives, and visual-anchored reward shaping. Applied to Qwen2.5-VL-7B, AVAR achieves an average gain of 7.0% across 7 multimodal reasoning benchmarks. Ablation studies further confirm that each component of AVAR contributes step-wise to the overall gains. The code, data, and models are available at https://github.com/lrlbbzl/Qwen-AVAR.

Problem

Research questions and friction points this paper is trying to address.

cold-start

multimodal reasoning

visual attention

attention localization

Multimodal Large Reasoning Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Attention Score

Cold-Start Initialization

Lazy Attention Localization