Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

181K/year
🤖 AI Summary
This study addresses the challenge that multimodal large language models (MLLMs) struggle with viewpoint-dependent spatial reasoning in 360-degree panoramic images, exhibiting significant performance degradation under viewpoint changes. The work formally defines and quantifies this capability for the first time, introducing the Perspective-Conditioned Spatial Reasoning (PCSR) diagnostic framework and a large-scale evaluation benchmark, PCSR-Bench, comprising eight distinct tasks. Comprehensive evaluation of 14 state-of-the-art MLLMs reveals a substantial gap between perception and reasoning: while basic directional judgment achieves 57.59% accuracy, performance plummets to 13.49%, 7.13%, and 0.64% on egocentric rotation, distortion, and compositional reasoning tasks, respectively. Through reinforcement learning with reward shaping, the authors fine-tune a 7B-parameter model, improving its overall accuracy from 31.10% to 60.06%, thereby demonstrating the partial learnability of this spatial reasoning ability.
📝 Abstract
Multimodal Large Language Models (MLLMs) show strong visual perception, yet remain limited in reasoning about space under changing viewpoints. We study this challenge as Perspective-Conditioned Spatial Reasoning (PCSR) in 360-degree omnidirectional images, where broad scene coverage reduces ambiguity from partial observations without eliminating the need for viewpoint-dependent inference. To assess this capability, we introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments. PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility. We evaluate 14 representative MLLMs and observe a substantial perception-reasoning gap: accuracy reaches 57.59% on foundational relative direction, but drops to 13.49% on egocentric rotation, 7.13% on egocentric distortion, and 0.64% on open-ended compositional reasoning. To probe the plasticity of this gap, we conduct an RL-based diagnostic study on a 7B-scale model. Reward shaping improves a matched 7B baseline from 31.10% to 60.06% under a controlled setting, suggesting that PCSR is partial plasticity rather than being fully immutable. Still, the gains are task-selective, sensitive to reward design including both weight allocation and reward formulation, and partially dependent on the evaluation protocol. These results position PCSR as a key bottleneck in current MLLMs and highlight limited but meaningful room for recovery under targeted optimization.
Problem

Research questions and friction points this paper is trying to address.

Perspective-Conditioned Spatial Reasoning
Multimodal Large Language Models
Omnidirectional Images
Viewpoint-Dependent Inference
Spatial Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Perspective-Conditioned Spatial Reasoning
Omnidirectional Images
PCSR-Bench
Multimodal Large Language Models
Reinforcement Learning Diagnosis
🔎 Similar Papers