How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
This work presents the first investigation into the capacity of large language models (LLMs) and vision-language models (VLMs) to comprehend viewpoint rotation purely from linguistic input, without any visual cues. Focusing on the Viewpoint Rotation Understanding (VRU) task, the study evaluates models’ ability to reason about multi-step viewpoint transformations described in text and infer the final perspective along with its corresponding scene. Through layer-wise probing and causal interventions at the attention-head level, the authors reveal that while models encode viewpoint information, they struggle to effectively bind viewpoints to observed content. Building on this insight, they propose selectively fine-tuning critical attention heads, which substantially improves VRU performance while mitigating catastrophic forgetting of general capabilities—highlighting both the limitations and a promising pathway for enhancing spatial reasoning in language models.

Technology Category

Application Category

📝 Abstract
Over the past year, spatial intelligence has drawn increasing attention. Many prior works study it from the perspective of visual-spatial intelligence, where models have access to visuospatial information from visual inputs. However, in the absence of visual information, whether linguistic intelligence alone is sufficient to endow models with spatial intelligence, and how models perform relevant tasks with text-only inputs still remain unexplored. Therefore, in this paper, we focus on a fundamental and critical capability in spatial intelligence from a linguistic perspective: viewpoint rotation understanding (VRU). Specifically, LLMs and VLMs are asked to infer their final viewpoint and predict the corresponding observation in an environment given textual description of viewpoint rotation and observation over multiple steps. We find that both LLMs and VLMs perform poorly on our proposed dataset while human can easily achieve 100% accuracy, indicating a substantial gap between current model capabilities and the requirements of spatial intelligence. To uncover the underlying mechanisms, we conduct a layer-wise probing analysis and head-wise causal intervention. Our findings reveal that although models encode viewpoint information in the hidden states, they appear to struggle to bind the viewpoint position with corresponding observation, resulting in a hallucination in final layers. Finally, we selectively fine-tune the key attention heads identified by causal intervention to improve VRU performance. Experimental results demonstrate that such selective fine-tuning achieves improved VRU performance while avoiding catastrophic forgetting of generic abilities. Our dataset and code will be released at https://github.com/Young-Zhen/VRU_Interpret .
Problem

Research questions and friction points this paper is trying to address.

spatial intelligence
viewpoint rotation understanding
language models
visual-language models
text-only input
Innovation

Methods, ideas, or system contributions that make the work stand out.

viewpoint rotation understanding
causal intervention
selective fine-tuning
spatial intelligence
interpretability
Z
Zhen Yang
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Ping Jian
Ping Jian
Beijing Institute of Technology
natural language processingmachine learning
Zhongbin Guo
Zhongbin Guo
Beijing Institute of Technology
Multimodal LLM
Z
Zuming Zhang
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
C
Chengzhi Li
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Y
Yonghong Deng
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Xinyue Zhang
Xinyue Zhang
Southwest University of Science and Technology
Machine Learning · Multi-view clustering
W
Wenpeng Lu
Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, China