ScenePilot-Bench: A Large-Scale Dataset and Benchmark for Evaluation of Vision-Language Models in Autonomous Driving

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) lack systematic evaluation in autonomous driving scenarios, particularly concerning safety, spatial reasoning, and cross-regional generalization. To address this gap, this work introduces ScenePilot-4K, a large-scale first-person driving dataset, and proposes ScenePilot-Bench, the first multidimensional benchmark framework tailored for safety-critical VLM assessment. The framework evaluates model capabilities along four axes—scene understanding, spatial perception, motion planning, and language generation—leveraging multi-granularity annotations (including risk levels, key agents, ego-vehicle trajectories, and camera parameters), safety-aware metrics, and GPT-Score. Comprehensive evaluations reveal the performance boundaries and critical shortcomings of current VLMs in driving-related reasoning, establishing a reliable foundation and clear direction for future research.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce ScenePilot-Bench, a large-scale first-person driving benchmark designed to evaluate vision-language models (VLMs) in autonomous driving scenarios. ScenePilot-Bench is built upon ScenePilot-4K, a diverse dataset comprising 3,847 hours of driving videos, annotated with multi-granularity information including scene descriptions, risk assessments, key participant identification, ego trajectories, and camera parameters. The benchmark features a four-axis evaluation suite that assesses VLM capabilities in scene understanding, spatial perception, motion planning, and GPT-Score, with safety-aware metrics and cross-region generalization settings. We benchmark representative VLMs on ScenePilot-Bench, providing empirical analyses that clarify current performance boundaries and identify gaps for driving-oriented reasoning. ScenePilot-Bench offers a comprehensive framework for evaluating and advancing VLMs in safety-critical autonomous driving contexts.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
autonomous driving
benchmark
scene understanding
safety-critical evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models
autonomous driving
large-scale benchmark
multi-granularity annotation
safety-aware evaluation
🔎 Similar Papers
No similar papers found.
Yujin Wang
Yujin Wang
Ph.D. Student, Tongji University
Y
Yutong Zheng
Department of Civil, Architectural, and Environmental Engineering, The University of Texas at Austin, Austin, TX, 78712, USA
W
Wenxian Fan
School of Transportation Science and Engineering, Beihang University, Beijing, 100191, China
T
Tianyi Wang
School of Vehicle and Mobility, Tsinghua University, Beijing, 100084, China
H
Hongqing Chu
College of Electronic and Information Engineering, Tongji University, Shanghai, 201804, China
Daxin Tian
Daxin Tian
Professor, Beihang University
vehicular networks
Bingzhao Gao
Bingzhao Gao
Professor, School of Automotive Studies, Tongji University
Jianqiang Wang
Jianqiang Wang
Associate Professor of Library and Information Studies, University at Buffalo
Information Retrievale-discovery
Hong Chen
Hong Chen
Distinguished Professor, College of Electronic & Information Engineering, Tongji University
Model Predictive ControlLearning ControlAutomotive ControlAutomated Driving