🤖 AI Summary
Current vision-language models (VLMs) exhibit limited capability in spatial visual question answering (Spatial VQA), particularly in reasoning about relative object positions, distances, and configurations. To address this, we propose the first R1-style reinforcement learning framework tailored for Spatial VQA, introducing Spatial-GRPO—a novel group-wise reinforcement policy optimization strategy. Spatial-GRPO constructs view-consistency rewards via spatial relation perturbations (e.g., mirroring), enabling interpretable spatial reasoning under unsupervised fine-tuning. Crucially, it eliminates reliance on human annotations, instead leveraging rule-based reward signals to guide multi-view reasoning path generation. Our method achieves state-of-the-art accuracy across multiple Spatial VQA benchmarks, with significant improvements over prior approaches. Moreover, the resulting reasoning process is highly interpretable and exhibits strong generalization to unseen spatial configurations and question types.
📝 Abstract
Spatial reasoning remains a critical yet underdeveloped capability in existing vision-language models (VLMs), especially for Spatial Visual Question Answering (Spatial VQA) tasks that require understanding relative positions, distances, and object configurations. Inspired by the R1 paradigm introduced in DeepSeek-R1, which enhances reasoning in language models through rule-based reinforcement learning (RL), we propose SVQA-R1, the first framework to extend R1-style training to spatial VQA. In particular, we introduce Spatial-GRPO, a novel group-wise RL strategy that constructs view-consistent rewards by perturbing spatial relations between objects, e.g., mirror flipping, thereby encouraging the model to develop a consistent and grounded understanding of space. Our model, SVQA-R1, not only achieves dramatically improved accuracy on spatial VQA benchmarks but also exhibits interpretable reasoning paths even without using supervised fine-tuning (SFT) data. Extensive experiments and visualization demonstrate the effectiveness of SVQA-R1 across multiple spatial reasoning benchmarks.