Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of high GPU memory consumption and inference latency in large models for autonomous driving, as well as the limited efficacy of conventional fine-tuning in enhancing small-model performance. The authors propose a multi-teacher knowledge distillation framework that decomposes the driving task into three stages—perception, reasoning, and planning—and employs a layer-specific attention mechanism to extract fine-grained distillation signals. To mitigate gradient conflicts among teachers of heterogeneous capabilities, an asymmetric gradient projection strategy is introduced. By integrating single-teacher models tailored to distinct capabilities, the method substantially compresses and accelerates vision-language models. Experiments demonstrate that the distilled InternVL3-1B model reduces memory usage by approximately 42× and achieves an 11.4× throughput improvement, outperforming its 78B counterpart on DriveBench overall and surpassing GPT-5.1 in planning-specific metrics.

Technology Category

Application Category

📝 Abstract
Autonomous driving is an important and safety-critical task, and recent advances in LLMs/VLMs have opened new possibilities for reasoning and planning in this domain. However, large models demand substantial GPU memory and exhibit high inference latency, while conventional supervised fine-tuning (SFT) often struggles to bridge the capability gaps of small models. To address these limitations, we propose Drive-KD, a framework that decomposes autonomous driving into a"perception-reasoning-planning"triad and transfers these capabilities via knowledge distillation. We identify layer-specific attention as the distillation signal to construct capability-specific single-teacher models that outperform baselines. Moreover, we unify these single-teacher settings into a multi-teacher distillation framework and introduce asymmetric gradient projection to mitigate cross-capability gradient conflicts. Extensive evaluations validate the generalization of our method across diverse model families and scales. Experiments show that our distilled InternVL3-1B model, with ~42 times less GPU memory and ~11.4 times higher throughput, achieves better overall performance than the pretrained 78B model from the same family on DriveBench, and surpasses GPT-5.1 on the planning dimension, providing insights toward efficient autonomous driving VLMs.
Problem

Research questions and friction points this paper is trying to address.

autonomous driving
vision-language models
knowledge distillation
model efficiency
capability gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-teacher distillation
asymmetric gradient projection
layer-specific attention
vision-language models
autonomous driving
🔎 Similar Papers
No similar papers found.
W
Weitong Lian
Zhejiang University, Hangzhou, China
Z
Zecong Tang
Zhejiang University, Hangzhou, China
Haoran Li
Haoran Li
University of Science and Technology of China
3D Generation 3D Editing 3D Understanding
T
Tianjian Gao
Zhejiang University, Hangzhou, China
Y
Yifei Wang
Zhejiang University, Hangzhou, China
Zixu Wang
Zixu Wang
Technical University of Munich & Infineon Technologies AG.
Deep learningLLMSoftware engineeringAutonomous driving
L
Lingyi Meng
Zhejiang University, Hangzhou, China
T
Tengju Ru
Zhejiang University, Hangzhou, China
Z
Zhejun Cui
Zhejiang University, Hangzhou, China
Y
Yichen Zhu
Zhejiang University, Hangzhou, China
H
Hangshuo Cao
Zhejiang University, Hangzhou, China
Qi Kang
Qi Kang
同济大学
计算智能、人工智能、机器学习
T
Tianxing Chen
The University of Hong Kong, Hong Kong, China
Y
Yusen Qin
D-Robotics, Shenzhen, China
K
Kaixuan Wang
The University of Hong Kong, Hong Kong, China
Yu Zhang
Yu Zhang
Associate Professor, Zhejiang University
SLAM3D VisionRobotics