CoC-VLA: Delving into Adversarial Domain Transfer for Explainable Autonomous Driving via Chain-of-Causality Visual-Language-Action Model

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Weak generalization to long-tail complex scenarios—such as subtle human behaviors, traffic accidents, and non-compliant driving—remains a critical challenge in autonomous driving. To address this, we propose a vision-language-action (VLA)-driven adversarial domain adaptation framework for end-to-end simulation-to-real transfer. Methodologically, we are the first to integrate VLA models into this cross-domain setting, coupling chain-of-causal reasoning with a learnable textual adapter to enable fine-grained driving logic modeling. We further design a novel adversarial training mechanism and a tailored backward-propagation strategy that jointly leverage the long-tail coverage of synthetic data and the fidelity of real-world data. Experiments demonstrate substantial improvements in the student model’s generalization, interpretability, and complex decision-making performance on real-road scenarios. Our results validate the effectiveness of causal-guided VLA-based domain adaptation for tackling long-tail challenges in autonomous driving.

Technology Category

Application Category

📝 Abstract

Autonomous driving represents a prominent application of artificial intelligence. Recent approaches have shifted from focusing solely on common scenarios to addressing complex, long-tail situations such as subtle human behaviors, traffic accidents, and non-compliant driving patterns. Given the demonstrated capabilities of large language models (LLMs) in understanding visual and natural language inputs and following instructions, recent methods have integrated LLMs into autonomous driving systems to enhance reasoning, interpretability, and performance across diverse scenarios. However, existing methods typically rely either on real-world data, which is suitable for industrial deployment, or on simulation data tailored to rare or hard case scenarios. Few approaches effectively integrate the complementary advantages of both data sources. To address this limitation, we propose a novel VLM-guided, end-to-end adversarial transfer framework for autonomous driving that transfers long-tail handling capabilities from simulation to real-world deployment, named CoC-VLA. The framework comprises a teacher VLM model, a student VLM model, and a discriminator. Both the teacher and student VLM models utilize a shared base architecture, termed the Chain-of-Causality Visual-Language Model (CoC VLM), which integrates temporal information via an end-to-end text adapter. This architecture supports chain-of-thought reasoning to infer complex driving logic. The teacher and student VLM models are pre-trained separately on simulated and real-world datasets. The discriminator is trained adversarially to facilitate the transfer of long-tail handling capabilities from simulated to real-world environments by the student VLM model, using a novel backpropagation strategy.

Problem

Research questions and friction points this paper is trying to address.

Integrating simulation and real-world data for autonomous driving systems

Transferring long-tail scenario handling from simulation to real deployment

Enhancing reasoning and interpretability in complex driving situations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial transfer framework for autonomous driving

Chain-of-causality visual-language model architecture

Simulation-to-real knowledge transfer via discriminator

🔎 Similar Papers

No similar papers found.