Distilling Multi-modal Large Language Models for Autonomous Driving

📅 2025-01-16

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

To address the trade-off between poor generalization on long-tail scenarios and low computational efficiency in autonomous driving planning, this paper proposes DiMA: a framework that pioneers distilling world knowledge from multimodal large language models (MLLMs) into a lightweight vision-based end-to-end planner. Key contributions include: (1) a jointly trained semantic-aligned scene encoder; (2) an LLM-optional reasoning mechanism that decouples knowledge transfer from real-time inference; and (3) structured surrogate tasks to align visual–linguistic representations. Evaluated on the nuScenes planning benchmark, DiMA achieves state-of-the-art performance—reducing overall L2 trajectory error by 37% and collision rate by 80%. Notably, on long-tail scenarios, trajectory error further decreases by 44%, demonstrating superior generalization without compromising inference efficiency.

Technology Category

Application Category

📝 Abstract

Autonomous driving demands safe motion planning, especially in critical"long-tail"scenarios. Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events. However, using LLMs at test time introduces high computational costs. To address this, we propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM. DiMA distills the information from a multi-modal LLM to a vision-based end-to-end planner through a set of specially designed surrogate tasks. Under a joint training strategy, a scene encoder common to both networks produces structured representations that are semantically grounded as well as aligned to the final planning objective. Notably, the LLM is optional at inference, enabling robust planning without compromising on efficiency. Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in longtail scenarios. DiMA also achieves state-of-the-art performance on the nuScenes planning benchmark.

Problem

Research questions and friction points this paper is trying to address.

Autonomous Vehicles

Long-tail Scenarios

Route Planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

DiMA

Visual Planning

Large Language Models

🔎 Similar Papers

No similar papers found.