🤖 AI Summary
To address excessive reasoning-induced latency and computational overhead in vision-language models (VLMs) for autonomous driving, this paper proposes the first early-exit framework integrating causal inference with driving-domain priors. Unlike conventional heuristic exit strategies, our method constructs a hierarchical causal graph to model dependencies among reasoning steps, and jointly employs dynamic confidence estimation and domain-adaptive exit decision-making to terminate redundant computation once sufficient semantic evidence is attained. Evaluated on Waymo and CODA benchmarks, the framework reduces inference latency by up to 57.58% while improving object detection mAP by 44%. Real-world deployment on the Autoware Universe platform confirms consistent low-latency performance (average reduction of 51.2%) and enhanced robustness. The core contribution lies in pioneering the integration of causal reasoning into VLM early-exit decisions—enabling task-aware, interpretable, and adaptive efficient inference for autonomous driving.
📝 Abstract
With the rapid advancement of autonomous driving, deploying Vision-Language Models (VLMs) to enhance perception and decision-making has become increasingly common. However, the real-time application of VLMs is hindered by high latency and computational overhead, limiting their effectiveness in time-critical driving scenarios. This challenge is particularly evident when VLMs exhibit over-inference, continuing to process unnecessary layers even after confident predictions have been reached. To address this inefficiency, we propose AD-EE, an Early Exit framework that incorporates domain characteristics of autonomous driving and leverages causal inference to identify optimal exit layers. We evaluate our method on large-scale real-world autonomous driving datasets, including Waymo and the corner-case-focused CODA, as well as on a real vehicle running the Autoware Universe platform. Extensive experiments across multiple VLMs show that our method significantly reduces latency, with maximum improvements reaching up to 57.58%, and enhances object detection accuracy, with maximum gains of up to 44%.