🤖 AI Summary
This work addresses the challenges of high GPU memory consumption and inference latency in large models for autonomous driving, as well as the limited efficacy of conventional fine-tuning in enhancing small-model performance. The authors propose a multi-teacher knowledge distillation framework that decomposes the driving task into three stages—perception, reasoning, and planning—and employs a layer-specific attention mechanism to extract fine-grained distillation signals. To mitigate gradient conflicts among teachers of heterogeneous capabilities, an asymmetric gradient projection strategy is introduced. By integrating single-teacher models tailored to distinct capabilities, the method substantially compresses and accelerates vision-language models. Experiments demonstrate that the distilled InternVL3-1B model reduces memory usage by approximately 42× and achieves an 11.4× throughput improvement, outperforming its 78B counterpart on DriveBench overall and surpassing GPT-5.1 in planning-specific metrics.
📝 Abstract
Autonomous driving is an important and safety-critical task, and recent advances in LLMs/VLMs have opened new possibilities for reasoning and planning in this domain. However, large models demand substantial GPU memory and exhibit high inference latency, while conventional supervised fine-tuning (SFT) often struggles to bridge the capability gaps of small models. To address these limitations, we propose Drive-KD, a framework that decomposes autonomous driving into a"perception-reasoning-planning"triad and transfers these capabilities via knowledge distillation. We identify layer-specific attention as the distillation signal to construct capability-specific single-teacher models that outperform baselines. Moreover, we unify these single-teacher settings into a multi-teacher distillation framework and introduce asymmetric gradient projection to mitigate cross-capability gradient conflicts. Extensive evaluations validate the generalization of our method across diverse model families and scales. Experiments show that our distilled InternVL3-1B model, with ~42 times less GPU memory and ~11.4 times higher throughput, achieves better overall performance than the pretrained 78B model from the same family on DriveBench, and surpasses GPT-5.1 on the planning dimension, providing insights toward efficient autonomous driving VLMs.