DriveMM: All-in-One Large Multimodal Model for Autonomous Driving

๐Ÿ“… 2024-12-10
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 15
โœจ Influential: 1
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing data-driven autonomous driving approaches are typically constrained to single datasets and specific tasks, exhibiting limited generalization capability. To address this, we propose UniADโ€”the first unified, multimodal large model for autonomous driving that jointly handles heterogeneous inputs (e.g., images and multi-view videos) and end-to-end supports perception, prediction, and planning. Methodologically, UniAD employs a joint LLMโ€“vision encoder architecture, incorporating curriculum-based pretraining, standardized fine-tuning on diverse autonomous driving data sources, cross-modal alignment, and unified instruction-tuning across tasks. We further introduce a cross-benchmark zero-shot generalization evaluation framework. UniAD achieves state-of-the-art performance across six major benchmarks and demonstrates superior zero-shot transfer to unseen datasets. The code and models are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose DriveMM, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD-related datasets to fine-tune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks. We hope DriveMM as a promising solution for future end-to-end autonomous driving applications in the real world. Project page with code: https://github.com/zhijian11/DriveMM.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited generalization in current autonomous driving models
Integrates diverse data inputs for comprehensive AD tasks
Enhances performance across multiple benchmarks and unseen datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

General large multimodal model for diverse data
Curriculum pre-training for varied visual signals
Augmented datasets for all-in-one AD tasks
๐Ÿ”Ž Similar Papers
No similar papers found.