๐ค AI Summary
Existing data-driven autonomous driving approaches are typically constrained to single datasets and specific tasks, exhibiting limited generalization capability. To address this, we propose UniADโthe first unified, multimodal large model for autonomous driving that jointly handles heterogeneous inputs (e.g., images and multi-view videos) and end-to-end supports perception, prediction, and planning. Methodologically, UniAD employs a joint LLMโvision encoder architecture, incorporating curriculum-based pretraining, standardized fine-tuning on diverse autonomous driving data sources, cross-modal alignment, and unified instruction-tuning across tasks. We further introduce a cross-benchmark zero-shot generalization evaluation framework. UniAD achieves state-of-the-art performance across six major benchmarks and demonstrates superior zero-shot transfer to unseen datasets. The code and models are publicly released.
๐ Abstract
Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose DriveMM, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD-related datasets to fine-tune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks. We hope DriveMM as a promising solution for future end-to-end autonomous driving applications in the real world. Project page with code: https://github.com/zhijian11/DriveMM.