SVQA-R1: Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Current vision-language models (VLMs) exhibit limited capability in spatial visual question answering (Spatial VQA), particularly in reasoning about relative object positions, distances, and configurations. To address this, we propose the first R1-style reinforcement learning framework tailored for Spatial VQA, introducing Spatial-GRPO—a novel group-wise reinforcement policy optimization strategy. Spatial-GRPO constructs view-consistency rewards via spatial relation perturbations (e.g., mirroring), enabling interpretable spatial reasoning under unsupervised fine-tuning. Crucially, it eliminates reliance on human annotations, instead leveraging rule-based reward signals to guide multi-view reasoning path generation. Our method achieves state-of-the-art accuracy across multiple Spatial VQA benchmarks, with significant improvements over prior approaches. Moreover, the resulting reasoning process is highly interpretable and exhibits strong generalization to unseen spatial configurations and question types.

Technology Category

Application Category

📝 Abstract

Spatial reasoning remains a critical yet underdeveloped capability in existing vision-language models (VLMs), especially for Spatial Visual Question Answering (Spatial VQA) tasks that require understanding relative positions, distances, and object configurations. Inspired by the R1 paradigm introduced in DeepSeek-R1, which enhances reasoning in language models through rule-based reinforcement learning (RL), we propose SVQA-R1, the first framework to extend R1-style training to spatial VQA. In particular, we introduce Spatial-GRPO, a novel group-wise RL strategy that constructs view-consistent rewards by perturbing spatial relations between objects, e.g., mirror flipping, thereby encouraging the model to develop a consistent and grounded understanding of space. Our model, SVQA-R1, not only achieves dramatically improved accuracy on spatial VQA benchmarks but also exhibits interpretable reasoning paths even without using supervised fine-tuning (SFT) data. Extensive experiments and visualization demonstrate the effectiveness of SVQA-R1 across multiple spatial reasoning benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial reasoning in vision-language models for Spatial VQA tasks

Developing view-consistent rewards via Spatial-GRPO for grounded spatial understanding

Improving accuracy and interpretability in spatial reasoning without supervised fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends R1-style training to spatial VQA

Introduces Spatial-GRPO for view-consistent rewards

Uses mirror flipping to perturb spatial relations

🔎 Similar Papers

No similar papers found.