🤖 AI Summary
This work addresses the scarcity of rich interactive scenarios and precise multimodal alignment in existing autonomous driving datasets, which hinders the development of vision-language-action (VLA) models. To overcome this limitation, the authors introduce the Interaction-Enhanced Driving Dataset (IEDD), which leverages a scalable pipeline to extract millions of interaction-rich clips from naturalistic driving videos and proposes a novel trajectory-based method to quantitatively characterize interactions. Furthermore, they generate synthetic bird’s-eye-view videos (IEDD-VQA) that enable strict alignment between structured language descriptions and semantic actions. The dataset supports both training and evaluation of VLA models, and benchmarking across ten prominent vision-language models demonstrates its high reusability for model fine-tuning and reasoning capability assessment.
📝 Abstract
The evolution of autonomous driving towards full automation demands robust interactive capabilities; however, the development of Vision-Language-Action (VLA) models is constrained by the sparsity of interactive scenarios and inadequate multimodal alignment in existing data. To this end, this paper proposes the Interactive Enhanced Driving Dataset (IEDD). We develop a scalable pipeline to mine million-level interactive segments from naturalistic driving data based on interactive trajectories, and design metrics to quantify the interaction processes. Furthermore, the IEDD-VQA dataset is constructed by generating synthetic Bird's Eye View (BEV) videos where semantic actions are strictly aligned with structured language. Benchmark results evaluating ten mainstream Vision Language Models (VLMs) are provided to demonstrate the dataset's reuse value in assessing and fine-tuning the reasoning capabilities of autonomous driving models.