🤖 AI Summary
This work addresses the lack of effective evaluation for cross-view consistency, spatiotemporal coherence, and relational reasoning in existing vision-language models within dynamic driving scenarios. To this end, the authors construct a comprehensive benchmark encompassing five autonomous driving datasets, 20 distinct tasks, and 15.6K human-verified question-answer pairs. The benchmark uniquely formulates questions based on dynamic multi-relational scene graphs that explicitly model object states, spatial relationships, interactions, camera visibility, and temporal correspondences, while integrating explicit bird’s-eye-view (BEV) grounding with multi-source, multi-view data. Evaluation of 15 state-of-the-art models reveals a substantial performance gap—averaging 28.4 points below human performance—with cognitive scene construction identified as a critical bottleneck. Notably, incorporating explicit BEV grounding significantly enhances reasoning capabilities.
📝 Abstract
Spatiotemporal intelligence in autonomous driving (AD) requires an agent to integrate multi-view observations into a coherent scene representation, maintain object continuity across viewpoints and time, and reason about spatial relations, interactions, and future dynamics. However, existing AD vision-language benchmarks largely focus on single-view, static, ego-centric, or single-source question answering, leaving it unclear whether current Vision-Language Models (VLMs) can truly construct and reason over dynamic driving scenes. We introduce DriveSpatial, a benchmark of 15.6K human-verified QA pairs across 20 tasks from five large-scale AD datasets. DriveSpatial evaluates four abilities: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization. Unlike prior benchmarks, DriveSpatial is generated from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences, enabling QA pairs that enforce genuine cross-view and spatiotemporal reasoning. Evaluating 15 representative VLMs reveals a substantial human-model gap: the strongest model trails humans by 28.4 points, with Cognitive Scene Construction emerging as the key bottleneck. Further diagnostics show that language-only prompting is insufficient, while explicit BEV grounding consistently improves performance. These results suggest that current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence. DriveSpatial and its construction pipeline will be released to support future research.