🤖 AI Summary
This work addresses the vulnerability of end-to-end autonomous driving systems in safety-critical long-tail scenarios—such as construction zones and complex yield situations—by introducing the CLAP framework. CLAP leverages, for the first time, the inherent clustering properties of road patches in the latent space of vision-language-action (VLA) models. By combining supervised contrastive learning with direction-regularized soft prompt optimization, CLAP enhances performance in challenging scenarios without fine-tuning the frozen VLA backbone. Position-aware prompts are retrieved on demand via V2X communication to provide targeted adaptation. Evaluated on the NAVSIM benchmark, CLAP reduces planning error by 24% across multiple state-of-the-art VLA models in difficult scenarios while preserving performance in routine driving conditions.
📝 Abstract
End-to-end autonomous driving systems powered by Vision-Language-Action (VLA) models achieve strong performance on common driving scenarios, yet remain brittle in rare but safety-critical long-tail situations such as active construction zones and complex yielding geometries. In this paper, we present a method that addresses the long-tail challenging scenes beyond data scaling and model training. We introduce CLAP (Contrastive Latent-space Prompt optimization), a location-aware adaptation framework that augments a frozen VLA driving model with per-roadblock soft prompts, optimized from crowdsourced data and retrieved on demand via Vehicle-to-Everything (V2X) communication. Our approach rests on two observations from VLAs' latent space: (i) at the VLA's hidden-state layer, scenarios from the same roadblock cluster tightly and occupy compact regions of the latent space; and (ii) within a single roadblock, long-tail and normal frames are heavily intermixed in the latent representation, making it difficult to improve one without disturbing the other. CLAP addresses this via a two-stage pipeline: supervised contrastive learning to discover a roadblock-specific hard-scene direction, followed by directionally regularized prompt optimization that selectively improves challenging frames while preserving normal frame performance. On the NAVSIM benchmark with various state-of-the-art VLA backbones, CLAP reduces challenging scenario planning error by 24% with no regression on normal frames, significantly improving planning performance.