DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the lack of effective evaluation for cross-view consistency, spatiotemporal coherence, and relational reasoning in existing vision-language models within dynamic driving scenarios. To this end, the authors construct a comprehensive benchmark encompassing five autonomous driving datasets, 20 distinct tasks, and 15.6K human-verified question-answer pairs. The benchmark uniquely formulates questions based on dynamic multi-relational scene graphs that explicitly model object states, spatial relationships, interactions, camera visibility, and temporal correspondences, while integrating explicit bird’s-eye-view (BEV) grounding with multi-source, multi-view data. Evaluation of 15 state-of-the-art models reveals a substantial performance gap—averaging 28.4 points below human performance—with cognitive scene construction identified as a critical bottleneck. Notably, incorporating explicit BEV grounding significantly enhances reasoning capabilities.

📝 Abstract

Spatiotemporal intelligence in autonomous driving (AD) requires an agent to integrate multi-view observations into a coherent scene representation, maintain object continuity across viewpoints and time, and reason about spatial relations, interactions, and future dynamics. However, existing AD vision-language benchmarks largely focus on single-view, static, ego-centric, or single-source question answering, leaving it unclear whether current Vision-Language Models (VLMs) can truly construct and reason over dynamic driving scenes. We introduce DriveSpatial, a benchmark of 15.6K human-verified QA pairs across 20 tasks from five large-scale AD datasets. DriveSpatial evaluates four abilities: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization. Unlike prior benchmarks, DriveSpatial is generated from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences, enabling QA pairs that enforce genuine cross-view and spatiotemporal reasoning. Evaluating 15 representative VLMs reveals a substantial human-model gap: the strongest model trails humans by 28.4 points, with Cognitive Scene Construction emerging as the key bottleneck. Further diagnostics show that language-only prompting is insufficient, while explicit BEV grounding consistently improves performance. These results suggest that current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence. DriveSpatial and its construction pipeline will be released to support future research.

Problem

Research questions and friction points this paper is trying to address.

spatiotemporal intelligence

autonomous driving

vision-language models

scene construction

multi-view reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatiotemporal reasoning

vision-language models

scene graph