DriveMM: All-in-One Large Multimodal Model for Autonomous Driving

📅 2024-12-10

🏛️ arXiv.org

📈 Citations: 15

✨ Influential: 1

career value

235K/year

🤖 AI Summary

Existing data-driven autonomous driving approaches are typically constrained to single datasets and specific tasks, exhibiting limited generalization capability. To address this, we propose UniAD—the first unified, multimodal large model for autonomous driving that jointly handles heterogeneous inputs (e.g., images and multi-view videos) and end-to-end supports perception, prediction, and planning. Methodologically, UniAD employs a joint LLM–vision encoder architecture, incorporating curriculum-based pretraining, standardized fine-tuning on diverse autonomous driving data sources, cross-modal alignment, and unified instruction-tuning across tasks. We further introduce a cross-benchmark zero-shot generalization evaluation framework. UniAD achieves state-of-the-art performance across six major benchmarks and demonstrates superior zero-shot transfer to unseen datasets. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract

Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose DriveMM, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD-related datasets to fine-tune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks. We hope DriveMM as a promising solution for future end-to-end autonomous driving applications in the real world. Project page with code: https://github.com/zhijian11/DriveMM.

Problem

Research questions and friction points this paper is trying to address.

Addresses limited generalization in current autonomous driving models

Integrates diverse data inputs for comprehensive AD tasks

Enhances performance across multiple benchmarks and unseen datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

General large multimodal model for diverse data

Curriculum pre-training for varied visual signals

Augmented datasets for all-in-one AD tasks

🔎 Similar Papers

No similar papers found.