V2X-REALM: Vision-Language Model-Based Robust End-to-End Cooperative Autonomous Driving with Adaptive Long-Tail Modeling

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing cooperative autonomous driving systems face fundamental challenges in robust perception and decision-making for rare, diverse, and visually degraded long-tailed scenarios prevalent in urban environments. To address this, we propose a vision-language model (VLM)-based robust cooperative autonomous driving framework. Our method introduces three key innovations: (1) a prompt-driven pipeline for long-tailed scenario generation and evaluation; (2) a gated multi-scenario adaptive attention module for dynamic feature recalibration; and (3) a multi-task scenario-aware contrastive learning objective to enhance cross-scenario discriminability and multimodal semantic alignment. By synergistically integrating prompt engineering, contrastive learning, and adaptive attention mechanisms, our approach significantly improves perceptual robustness, semantic reasoning capability, planning accuracy, and safety under diverse adverse driving conditions—including occlusion, low illumination, and weather degradation. Experimental results demonstrate consistent performance gains across multiple benchmarks, establishing a scalable technical pathway toward end-to-end cooperative autonomous driving.

Technology Category

Application Category

📝 Abstract
Ensuring robust planning and decision-making under rare, diverse, and visually degraded long-tail scenarios remains a fundamental challenge for autonomous driving in urban environments. This issue becomes more critical in cooperative settings, where vehicles and infrastructure jointly perceive and reason across complex environments. To address this challenge, we propose V2X-REALM, a vision-language model (VLM)-based framework with adaptive multimodal learning for robust cooperative autonomous driving under long-tail scenarios. V2X-REALM introduces three core innovations: (i) a prompt-driven long-tail scenario generation and evaluation pipeline that leverages foundation models to synthesize realistic long-tail conditions such as snow and fog across vehicle- and infrastructure-side views, enriching training diversity efficiently; (ii) a gated multi-scenario adaptive attention module that modulates the visual stream using scenario priors to recalibrate ambiguous or corrupted features; and (iii) a multi-task scenario-aware contrastive learning objective that improves multimodal alignment and promotes cross-scenario feature separability. Extensive experiments demonstrate that V2X-REALM significantly outperforms existing baselines in robustness, semantic reasoning, safety, and planning accuracy under complex, challenging driving conditions, advancing the scalability of end-to-end cooperative autonomous driving.
Problem

Research questions and friction points this paper is trying to address.

Robust planning under rare long-tail driving scenarios
Cooperative perception in complex urban environments
Multimodal learning for degraded visual conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt-driven long-tail scenario generation pipeline
Gated multi-scenario adaptive attention module
Multi-task scenario-aware contrastive learning objective