🤖 AI Summary
This work addresses the critical challenge of extending vision-language models (VLMs) from 2D perception to end-to-end 3D scene understanding for autonomous driving. Methodologically, it introduces the first vision-language dataset and a counterfactual-driven modeling paradigm specifically designed for 3D driving tasks. It proposes a novel synthetic annotation framework to construct two complementary models—Omni-L and Omni-Q—that decouple vision-language alignment from 3D geometric reasoning. Key innovations include counterfactual reasoning modeling, multimodal alignment training, joint 3D trajectory–language supervision, and controllable synthetic data generation. Evaluated on DriveLM question answering and nuScenes open-loop planning benchmarks, the approach significantly outperforms state-of-the-art methods, delivering denser, more interpretable, and robust decision-level supervision signals. This establishes a new paradigm for leveraging VLMs in safety-critical autonomous driving systems.
📝 Abstract
The advances in vision-language models (VLMs) have led to a growing interest in autonomous driving to leverage their strong reasoning capabilities. However, extending these capabilities from 2D to full 3D understanding is crucial for real-world applications. To address this challenge, we propose OmniDrive, a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning. This approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions. Our counterfactual-based synthetic data annotation process generates large-scale, high-quality datasets, providing denser supervision signals that bridge planning trajectories and language-based reasoning. Futher, we explore two advanced OmniDrive-Agent frameworks, namely Omni-L and Omni-Q, to assess the importance of vision-language alignment versus 3D perception, revealing critical insights into designing effective LLM-agents. Significant improvements on the DriveLM Q&A benchmark and nuScenes open-loop planning demonstrate the effectiveness of our dataset and methods.