DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This work addresses the lack of effective evaluation for cross-view consistency, spatiotemporal coherence, and relational reasoning in existing vision-language models within dynamic driving scenarios. To this end, the authors construct a comprehensive benchmark encompassing five autonomous driving datasets, 20 distinct tasks, and 15.6K human-verified question-answer pairs. The benchmark uniquely formulates questions based on dynamic multi-relational scene graphs that explicitly model object states, spatial relationships, interactions, camera visibility, and temporal correspondences, while integrating explicit bird’s-eye-view (BEV) grounding with multi-source, multi-view data. Evaluation of 15 state-of-the-art models reveals a substantial performance gap—averaging 28.4 points below human performance—with cognitive scene construction identified as a critical bottleneck. Notably, incorporating explicit BEV grounding significantly enhances reasoning capabilities.
📝 Abstract
Spatiotemporal intelligence in autonomous driving (AD) requires an agent to integrate multi-view observations into a coherent scene representation, maintain object continuity across viewpoints and time, and reason about spatial relations, interactions, and future dynamics. However, existing AD vision-language benchmarks largely focus on single-view, static, ego-centric, or single-source question answering, leaving it unclear whether current Vision-Language Models (VLMs) can truly construct and reason over dynamic driving scenes. We introduce DriveSpatial, a benchmark of 15.6K human-verified QA pairs across 20 tasks from five large-scale AD datasets. DriveSpatial evaluates four abilities: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization. Unlike prior benchmarks, DriveSpatial is generated from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences, enabling QA pairs that enforce genuine cross-view and spatiotemporal reasoning. Evaluating 15 representative VLMs reveals a substantial human-model gap: the strongest model trails humans by 28.4 points, with Cognitive Scene Construction emerging as the key bottleneck. Further diagnostics show that language-only prompting is insufficient, while explicit BEV grounding consistently improves performance. These results suggest that current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence. DriveSpatial and its construction pipeline will be released to support future research.
Problem

Research questions and friction points this paper is trying to address.

spatiotemporal intelligence
autonomous driving
vision-language models
scene construction
multi-view reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatiotemporal reasoning
vision-language models
scene graph
autonomous driving benchmark
BEV grounding
H
Hao Vo
University of Arkansas, USA
Khoa Vo
Khoa Vo
Postdoctoral Fellow at the University of Arkansas, USA
Vision Language ModelComputer VisionDeep Learning
P
Phu Loc Nguyen
University of Arkansas, USA
S
Sieu Tran
University of Arkansas, USA
D
Duc Minh Nguyen
University of Arkansas, USA
N
Ngo Xuan Cuong
University of Arkansas, USA
G
Gladys Gawugah
University of Arkansas, USA
S
Sreevenkata Anjani Tishita Godavarthi
University of Arkansas, USA
Chase Rainwater
Chase Rainwater
University of Arkansas
logisticsoptimizationsecurity
Nghi D. Q. Bui
Nghi D. Q. Bui
Unknown affiliation
AI4CodeSoftware EngineeringCode AgentAI4SERL
Anh Nguyen
Anh Nguyen
University of Liverpool
Robotic VisionMachine LearningRobotics
D
Duy Minh Ho Nguyen
Max Planck Research School for Intelligent Systems
Ngan Le
Ngan Le
University of Arkansas
Artificial IntelligenceMachine LearningComputer Vision