OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

📅 2024-05-02
📈 Citations: 48
Influential: 6
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing vision-language models (VLMs) in autonomous driving—namely, their reliance on 2D perception and lack of 3D semantic-action causal modeling. To bridge this gap, we introduce the first full-stack vision-language-3D dataset tailored for autonomous driving, coupled with a counterfactual-driven synthetic annotation paradigm. Methodologically, we propose two complementary frameworks: Omni-L, which enables language-3D alignment via multimodal alignment training, and Omni-Q, which supports query-driven 3D trajectory generation through joint 3D trajectory–language modeling and high-fidelity synthetic supervision. Evaluated on DriveLM (visual question answering) and nuScenes open-loop planning benchmarks, our approach significantly outperforms state-of-the-art methods. It delivers denser, semantically richer 3D supervision signals and, for the first time, enables VLMs to evolve from 2D perception systems into fully 3D-reasoning autonomous driving agents.

Technology Category

Application Category

📝 Abstract
The advances in vision-language models (VLMs) have led to a growing interest in autonomous driving to leverage their strong reasoning capabilities. However, extending these capabilities from 2D to full 3D understanding is crucial for real-world applications. To address this challenge, we propose OmniDrive, a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning. This approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions. Our counterfactual-based synthetic data annotation process generates large-scale, high-quality datasets, providing denser supervision signals that bridge planning trajectories and language-based reasoning. Futher, we explore two advanced OmniDrive-Agent frameworks, namely Omni-L and Omni-Q, to assess the importance of vision-language alignment versus 3D perception, revealing critical insights into designing effective LLM-agents. Significant improvements on the DriveLM Q&A benchmark and nuScenes open-loop planning demonstrate the effectiveness of our dataset and methods.
Problem

Research questions and friction points this paper is trying to address.

Extending vision-language models from 2D to 3D understanding for autonomous driving
Aligning agent models with 3D driving tasks via counterfactual reasoning
Bridging planning trajectories and language reasoning with synthetic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns agent models with 3D driving tasks
Uses counterfactual reasoning for decision-making
Generates synthetic data for dense supervision
🔎 Similar Papers
No similar papers found.