ScenePilot-Bench: A Large-Scale Dataset and Benchmark for Evaluation of Vision-Language Models in Autonomous Driving

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Existing vision-language models (VLMs) lack systematic evaluation in autonomous driving scenarios, particularly concerning safety, spatial reasoning, and cross-regional generalization. To address this gap, this work introduces ScenePilot-4K, a large-scale first-person driving dataset, and proposes ScenePilot-Bench, the first multidimensional benchmark framework tailored for safety-critical VLM assessment. The framework evaluates model capabilities along four axes—scene understanding, spatial perception, motion planning, and language generation—leveraging multi-granularity annotations (including risk levels, key agents, ego-vehicle trajectories, and camera parameters), safety-aware metrics, and GPT-Score. Comprehensive evaluations reveal the performance boundaries and critical shortcomings of current VLMs in driving-related reasoning, establishing a reliable foundation and clear direction for future research.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce ScenePilot-Bench, a large-scale first-person driving benchmark designed to evaluate vision-language models (VLMs) in autonomous driving scenarios. ScenePilot-Bench is built upon ScenePilot-4K, a diverse dataset comprising 3,847 hours of driving videos, annotated with multi-granularity information including scene descriptions, risk assessments, key participant identification, ego trajectories, and camera parameters. The benchmark features a four-axis evaluation suite that assesses VLM capabilities in scene understanding, spatial perception, motion planning, and GPT-Score, with safety-aware metrics and cross-region generalization settings. We benchmark representative VLMs on ScenePilot-Bench, providing empirical analyses that clarify current performance boundaries and identify gaps for driving-oriented reasoning. ScenePilot-Bench offers a comprehensive framework for evaluating and advancing VLMs in safety-critical autonomous driving contexts.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

autonomous driving

benchmark

scene understanding

safety-critical evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models

autonomous driving

large-scale benchmark