Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Large Vision-Language Models (VLMs) exhibit significant deficiencies in fundamental spatial relation recognition (e.g., “under”, “behind”), primarily due to misalignment between cross-modal attention and the actual spatial positions of objects in images. Through attention trajectory and inter-layer distribution analysis, we identify attention–position misalignment as the key bottleneck. To address this, we propose ADAPTVIS—a training-free, inference-time attention adaptation method that dynamically sharpens or expands visual attention regions based on decoding confidence. ADAPTVIS is the first approach to enhance spatial reasoning robustness from a mechanistic interpretability perspective. Evaluated on benchmarks including WhatsUp and VSR, it achieves up to a 50-percentage-point absolute accuracy improvement, with negligible computational overhead. The code and data are publicly available.

Technology Category

Application Category

📝 Abstract

Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing"under"or"behind"relationships between only two objects, pose significant challenges for current VLMs. In this work, we study the spatial reasoning challenge from the lens of mechanistic interpretability, diving into the model's internal states to examine the interactions between image and text tokens. By tracing attention distribution over the image through out intermediate layers, we observe that successful spatial reasoning correlates strongly with the model's ability to align its attention distribution with actual object locations, particularly differing between familiar and unfamiliar spatial relationships. Motivated by these findings, we propose ADAPTVIS based on inference-time confidence scores to sharpen the attention on highly relevant regions when confident, while smoothing and broadening the attention window to consider a wider context when confidence is lower. This training-free decoding method shows significant improvement (e.g., up to a 50 absolute point improvement) on spatial reasoning benchmarks such as WhatsUp and VSR with negligible cost. We make code and data publicly available for research purposes at https://github.com/shiqichen17/AdaptVis.

Problem

Research questions and friction points this paper is trying to address.

VLMs struggle with spatial reasoning tasks.

Attention alignment with object locations is crucial.

Propose ADAPTVIS to improve spatial reasoning performance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mechanistic interpretability for spatial reasoning analysis

ADAPTVIS adjusts attention based on confidence scores

Training-free method improves spatial reasoning benchmarks

🔎 Similar Papers

No similar papers found.