VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness

📅 2026-01-19

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the insufficient robustness of autonomous driving systems in long-tail, safety-critical scenarios, primarily caused by the severe scarcity of such rare situations in real-world data. To tackle this challenge, the authors propose VILTA, a novel framework that integrates vision-language models (VLMs) directly into the closed-loop reinforcement learning training pipeline. By leveraging the VLM’s real-time comprehension of dynamic environments, VILTA enables fine-grained editing of surrounding agents’ future trajectories, facilitating end-to-end generation of adversarial scenarios. This approach overcomes the generalization limitations inherent in conventional two-stage generation paradigms and significantly enhances policy robustness and safety across diverse, highly realistic long-tail scenarios.

Technology Category

Application Category

📝 Abstract

The safe deployment of autonomous driving (AD) systems is fundamentally hindered by the long-tail problem, where rare yet critical driving scenarios are severely underrepresented in real-world data. Existing solutions including safety-critical scenario generation and closed-loop learning often rely on rule-based heuristics, resampling methods and generative models learned from offline datasets, limiting their ability to produce diverse and novel challenges. While recent works leverage Vision Language Models (VLMs) to produce scene descriptions that guide a separate, downstream model in generating hazardous trajectories for agents, such two-stage framework constrains the generative potential of VLMs, as the diversity of the final trajectories is ultimately limited by the generalization ceiling of the downstream algorithm. To overcome these limitations, we introduce VILTA (VLM-In-the-Loop Trajectory Adversary), a novel framework that integrates a VLM into the closed-loop training of AD agents. Unlike prior works, VILTA actively participates in the training loop by comprehending the dynamic driving environment and strategically generating challenging scenarios through direct, fine-grained editing of surrounding agents'future trajectories. This direct-editing approach fully leverages the VLM's powerful generalization capabilities to create a diverse curriculum of plausible yet challenging scenarios that extend beyond the scope of traditional methods. We demonstrate that our approach substantially enhances the safety and robustness of the resulting AD policy, particularly in its ability to navigate critical long-tail events.

Problem

Research questions and friction points this paper is trying to address.

long-tail problem

autonomous driving

safety-critical scenarios

driving policy robustness

scenario generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Language Model

Adversarial Training

Autonomous Driving