Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) underperform on fine-grained spatial reasoning tasks requiring multi-step logical deduction and precise spatial alignment. To address this, we propose SpatialReasoner-R1, a framework introducing three key innovations: (1) Multi-Model Monte Carlo Tree Search (M3CTS), the first MCTS variant leveraging multiple specialized models to generate long-chain-of-thought (LongCoT) reasoning trajectories; (2) paragraph-level fine-grained Direct Preference Optimization (fDPO), jointly modeling spatial grounding fidelity, logical coherence, and visual consistency; and (3) a learnable spatial reward mechanism to guide preference optimization. Experiments demonstrate that fDPO improves performance by 4.1% on spatial quality tasks and 9.0% on spatial quantity tasks. SpatialReasoner-R1 achieves a new state-of-the-art on SPATIALRGPT-Bench, outperforming prior methods by 9.8%, while preserving general-purpose VLM capabilities.

Technology Category

Application Category

📝 Abstract
Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCoT) reasoning trajectories. In addition, we propose fine-grained Direct Preference Optimization (fDPO), which introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves an average improvement of 4.1% over standard DPO across spatial quality tasks, and a 9.0% gain in spatial quantity tasks. SpatialReasoner-R1, trained with fDPO, sets a new SoTA on SPATIALRGPT-Bench, outperforming the strongest baseline by 9.8% in average accuracy, while maintaining competitive performance on general vision-language tasks.
Problem

Research questions and friction points this paper is trying to address.

Improves VLMs' fine-grained spatial reasoning abilities
Addresses multi-step logic and precise spatial alignment challenges
Enhances visual consistency and logical coherence in responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Model Monte Carlo Tree Search for reasoning
Fine-grained Direct Preference Optimization with segments
Spatial reward mechanism for visual and logical evaluation
🔎 Similar Papers
No similar papers found.
Y
Yifan Shen
University of Illinois Urbana-Champaign
Yuanzhe Liu
Yuanzhe Liu
CS Ph.D. Student at Rensselaer Polytechnic Institute
multi agentcode optimizationcontrollable music generation
J
Jingyuan Zhu
University of Pennsylvania
X
Xu Cao
University of Illinois Urbana-Champaign
X
Xiaofeng Zhang
Shanghai Jiao Tong University
Y
Yixiao He
University of Illinois Urbana-Champaign
W
Wenming Ye
Google
J
James Matthew Rehg
University of Illinois Urbana-Champaign
Ismini Lourentzou
Ismini Lourentzou
Assistant Professor, University of Illinois Urbana - Champaign
Machine LearningNatural Language ProcessingComputer Vision