🤖 AI Summary
This work addresses the limitations of existing vision-language models (VLMs) in autonomous driving—namely, their reliance on 2D perception and lack of 3D semantic-action causal modeling. To bridge this gap, we introduce the first full-stack vision-language-3D dataset tailored for autonomous driving, coupled with a counterfactual-driven synthetic annotation paradigm. Methodologically, we propose two complementary frameworks: Omni-L, which enables language-3D alignment via multimodal alignment training, and Omni-Q, which supports query-driven 3D trajectory generation through joint 3D trajectory–language modeling and high-fidelity synthetic supervision. Evaluated on DriveLM (visual question answering) and nuScenes open-loop planning benchmarks, our approach significantly outperforms state-of-the-art methods. It delivers denser, semantically richer 3D supervision signals and, for the first time, enables VLMs to evolve from 2D perception systems into fully 3D-reasoning autonomous driving agents.
📝 Abstract
The advances in vision-language models (VLMs) have led to a growing interest in autonomous driving to leverage their strong reasoning capabilities. However, extending these capabilities from 2D to full 3D understanding is crucial for real-world applications. To address this challenge, we propose OmniDrive, a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning. This approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions. Our counterfactual-based synthetic data annotation process generates large-scale, high-quality datasets, providing denser supervision signals that bridge planning trajectories and language-based reasoning. Futher, we explore two advanced OmniDrive-Agent frameworks, namely Omni-L and Omni-Q, to assess the importance of vision-language alignment versus 3D perception, revealing critical insights into designing effective LLM-agents. Significant improvements on the DriveLM Q&A benchmark and nuScenes open-loop planning demonstrate the effectiveness of our dataset and methods.