🤖 AI Summary
Vision-language models (VLMs) exhibit limited spatial reasoning capabilities—particularly in geometric relations, object pose, and relative positioning—hindering their reliability in autonomous driving. Method: We introduce SURDS, the first large-scale spatial understanding benchmark tailored to real-world driving environments, covering six fine-grained tasks including orientation, depth estimation, and pixel-level localization. We propose GRPO, a reinforcement learning alignment method guided by dual signals: spatial awareness and logical consistency, incorporating multi-dimensional rewards for positional accuracy, logical coherence, answer correctness, and output format compliance. We further construct a multi-granularity spatial question-answering dataset on nuScenes. Contribution/Results: We publicly release the SURDS benchmark, evaluation toolkit, and code. On SURDS, GRPO achieves 40.80 points, substantially outperforming GPT-4o (13.30) and Gemini-2.0-flash (35.71), demonstrating that RL-based alignment robustly enhances VLMs’ spatial reasoning in driving contexts.
📝 Abstract
Accurate spatial reasoning in outdoor environments - covering geometry, object pose, and inter-object relationships - is fundamental to downstream tasks such as mapping, motion forecasting, and high-level planning in autonomous driving. We introduce SURDS, a large-scale benchmark designed to systematically evaluate the spatial reasoning capabilities of vision language models (VLMs). Built on the nuScenes dataset, SURDS comprises 41,080 vision-question-answer training instances and 9,250 evaluation samples, spanning six spatial categories: orientation, depth estimation, pixel-level localization, pairwise distance, lateral ordering, and front-behind relations. We benchmark leading general-purpose VLMs, including GPT, Gemini, and Qwen, revealing persistent limitations in fine-grained spatial understanding. To address these deficiencies, we go beyond static evaluation and explore whether alignment techniques can improve spatial reasoning performance. Specifically, we propose a reinforcement learning-based alignment scheme leveraging spatially grounded reward signals - capturing both perception-level accuracy (location) and reasoning consistency (logic). We further incorporate final-answer correctness and output-format rewards to guide fine-grained policy adaptation. Our GRPO-aligned variant achieves an overall score of 40.80 in the SURDS benchmark. Notably, it outperforms proprietary systems such as GPT-4o (13.30) and Gemini-2.0-flash (35.71). To our best knowledge, this is the first study to demonstrate that reinforcement learning-based alignment can significantly and consistently enhance the spatial reasoning capabilities of VLMs in real-world driving contexts. We release the SURDS benchmark, evaluation toolkit, and GRPO alignment code through: https://github.com/XiandaGuo/Drive-MLLM.