🤖 AI Summary
Existing vision-language model benchmarks are largely confined to single-hop spatial relations, making them inadequate for evaluating multi-hop compositional spatial reasoning and precise visual grounding. To address this gap, this work proposes the first comprehensive evaluation benchmark supporting complex spatial queries spanning one to three hops. It introduces a novel metric, Acc@50IoU, which jointly assesses answer selection accuracy and bounding box localization precision, and releases MultihopSpatial-Train, a large-scale training corpus. Evaluation of 37 state-of-the-art models on this benchmark reveals significant deficiencies in compositional spatial reasoning. Furthermore, reinforcement learning-based post-training is shown to effectively enhance both spatial reasoning capabilities and downstream embodied task performance.
📝 Abstract
Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single-hop relations, neglecting the multi-hop compositional reasoning and precise visual grounding essential for real-world scenarios. To address this, we introduce MultihopSpatial, offering three key contributions: (1) A comprehensive benchmark designed for multi-hop and compositional spatial reasoning, featuring 1- to 3-hop complex queries across diverse spatial perspectives. (2) Acc@50IoU, a complementary metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction - capabilities vital for robust VLA deployment. (3) MultihopSpatial-Train, a dedicated large-scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state-of-the-art VLMs yields eight key insights, revealing that compositional spatial reasoning remains a formidable challenge. Finally, we demonstrate that reinforcement learning post-training on our corpus enhances both intrinsic VLM spatial reasoning and downstream embodied manipulation performance.