π€ AI Summary
Existing vision-language models often lack explicit constraints on the reasoning process in spatial reasoning tasks, leading to insufficient reliance on visual evidence and unstable reasoning trajectories. This work proposes ProSR, a process-shaping optimization framework that extends the optimization objective beyond answer correctness to explicitly shape the reasoning process itself. ProSR enhances visual grounding and stabilizes reasoning paths through two novel regularization terms: counterfactual invariance penalty and tail-drift penalty. Integrated with chain-of-thought fine-tuning and reinforcement learning, ProSR achieves substantial accuracy gains on multiple challenging and out-of-distribution spatial reasoning benchmarks while producing more reliable and interpretable reasoning traces.
π Abstract
Reliable spatial reasoning remains a core bottleneck for vision-language models (VLMs). Existing mainstream training paradigms for spatial reasoning largely rely on outcome alignment or process imitation, lacking explicit constraints on the reasoning process, and therefore struggle to ensure genuine visual dependence and stable reasoning trajectories. In this paper, we construct a high-quality CoT dataset covering diverse spatial phenomena and diagnose the model's reasoning process, revealing two typical types of process degradation during reinforcement learning optimization: Spurious Grounding, which bypasses visual evidence, and Tail Instability, where uncertainty abnormally rises in the later stage of reasoning. To address these issues, we propose ProSR, a process-shaping optimization framework for spatial reasoning. Through a Counterfactual Invariance Penalty and a Tail Drift Penalty, ProSR extends the optimization objective from single answer correctness to two process-level dimensions: visual dependence and trajectory stability. Experiments on multiple complex and out-of-distribution spatial reasoning benchmarks show that ProSR improves answer accuracy while generating reasoning trajectories that are more stable and more dependent on visual evidence.