STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing autonomous driving benchmarks primarily focus on single-view semantic understanding and lack systematic evaluation of multimodal vision-language models’ (VLMs) spatiotemporal reasoning capabilities over multi-view video and LiDAR sequences. Method: We introduce STSBench—the first spatiotemporal scene benchmark for autonomous driving—featuring a novel, real 3D perception–guided framework for multi-view, multi-frame scenario mining and human-in-the-loop validation, enabling end-to-end assessment of ego-vehicle decision-making and multi-agent traffic interactions. Leveraging NuScenes ground truth, we automate scenario extraction, generate structured multiple-choice questions, and conduct expert verification to construct the STSnu sub-benchmark. Contribution/Results: STSnu comprises 43 scenario categories and 971 rigorously validated questions. Empirical analysis reveals critical deficiencies in current VLMs’ traffic dynamics modeling, establishing STSBench as a foundational benchmark and empirical basis for advancing spatiotemporal reasoning paradigms in autonomous driving.

Technology Category

Application Category

📝 Abstract
We introduce STSBench, a scenario-based framework to benchmark the holistic understanding of vision-language models (VLMs) for autonomous driving. The framework automatically mines pre-defined traffic scenarios from any dataset using ground-truth annotations, provides an intuitive user interface for efficient human verification, and generates multiple-choice questions for model evaluation. Applied to the NuScenes dataset, we present STSnu, the first benchmark that evaluates the spatio-temporal reasoning capabilities of VLMs based on comprehensive 3D perception. Existing benchmarks typically target off-the-shelf or fine-tuned VLMs for images or videos from a single viewpoint and focus on semantic tasks such as object recognition, dense captioning, risk assessment, or scene understanding. In contrast, STSnu evaluates driving expert VLMs for end-to-end driving, operating on videos from multi-view cameras or LiDAR. It specifically assesses their ability to reason about both ego-vehicle actions and complex interactions among traffic participants, a crucial capability for autonomous vehicles. The benchmark features 43 diverse scenarios spanning multiple views and frames, resulting in 971 human-verified multiple-choice questions. A thorough evaluation uncovers critical shortcomings in existing models' ability to reason about fundamental traffic dynamics in complex environments. These findings highlight the urgent need for architectural advances that explicitly model spatio-temporal reasoning. By addressing a core gap in spatio-temporal evaluation, STSBench enables the development of more robust and explainable VLMs for autonomous driving.
Problem

Research questions and friction points this paper is trying to address.

Evaluates VLMs' spatio-temporal reasoning in autonomous driving scenarios
Assesses ego-vehicle actions and traffic participant interactions
Identifies gaps in models' understanding of traffic dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically mines traffic scenarios from datasets
Provides user interface for human verification
Generates multiple-choice questions for evaluation
🔎 Similar Papers
No similar papers found.
C
Christian Fruhwirth-Reisinger
Institute of Visual Computing, Graz University of Technology; Christian Doppler Laboratory for Embedded Machine Learning
D
Duvsan Mali'c
Institute of Visual Computing, Graz University of Technology; Christian Doppler Laboratory for Embedded Machine Learning
W
Wei Lin
Institute for Machine Learning, Johannes Kepler University Linz
D
David Schinagl
Institute of Visual Computing, Graz University of Technology; Christian Doppler Laboratory for Embedded Machine Learning
Samuel Schulter
Samuel Schulter
Amazon AGI
Computer VisionMachine Learning
Horst Possegger
Horst Possegger
Senior Scientist, Graz University of Technology
Computer VisionMachine LearningVisual PerceptionPattern Recognition