🤖 AI Summary
This work addresses the limited performance of large vision-language models (VLMs) on spatial reasoning tasks and the unclear role of their internal attention mechanisms in supporting spatial understanding. The authors introduce CogVSR, a dataset that decomposes complex spatial reasoning into chained subproblems, and combine probing analyses with intervention experiments to systematically identify, for the first time, sparsely distributed yet critical spatial-specialized attention heads within VLMs. They demonstrate that activating these heads significantly improves spatial reasoning accuracy, whereas ablating them leads to marked performance degradation, thereby revealing their essential role in spatial perception and relational reasoning. These findings offer a novel pathway toward enhancing the spatial understanding capabilities of multimodal models.
📝 Abstract
Despite remarkable advances in large Vision-Language Models (VLMs), spatial reasoning remains a persistent challenge. In this work, we investigate how attention heads within VLMs contribute to spatial reasoning by analyzing their functional roles through a mechanistic interpretability lens. We introduce CogVSR, a dataset that decomposes complex spatial reasoning questions into step-by-step subquestions designed to simulate human-like reasoning via a chain-of-thought paradigm, with each subquestion linked to specific cognitive functions such as spatial perception or relational reasoning. Building on CogVSR, we develop a probing framework to identify and characterize attention heads specialized for these functions. Our analysis across diverse VLM families reveals that these functional heads are universally sparse, vary in number and distribution across functions. Notably, spatially specialized heads are fewer than those for other cognitive functions, highlighting their scarcity. We propose methods to activate latent spatial heads, improving spatial understanding. Intervention experiments further demonstrate their critical role in spatial reasoning: removing functional heads leads to performance degradation, while emphasizing them enhances accuracy. This study provides new interpretability driven insights into how VLMs attend to space and paves the way for enhancing complex spatial reasoning in multimodal models.