Learning from Mistakes: Post-Training for Driving VLA with Takeover Data

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the limitations of existing end-to-end vision-language-action (VLA) autonomous driving models, which are typically trained offline on static datasets and struggle with distribution shifts. Current post-training approaches rely solely on data collected after human takeovers, lacking mechanisms for proactive exploration and safety margins. To overcome these challenges, we propose TakeVLA, a novel framework that introduces pre-takeover language supervision to cultivate preventive driving behavior. TakeVLA further incorporates a “scene dreaming” mechanism that reconstructs takeover scenarios for reinforcement-based fine-tuning, enabling active exploration and learning from errors. By integrating pre-takeover linguistic annotations, scene reconstruction, and reinforcement learning, our method achieves a 4.93-point improvement in driving score over the SimLingo baseline on the Bench2Drive benchmark and increases the average time-to-collision (TTC) by 11.76%, significantly enhancing both safety and driving performance.

Technology Category

Application Category

📝 Abstract

Current Vision-Language-Action (VLA) paradigms in end-to-end autonomous driving rely on offline training from static datasets, leaving them vulnerable to distribution shift. Recent post-training methods use takeover data to mitigate this by augmenting the dataset with high-quality expert takeover samples, yet they suffer from two key limitations: supervision restricted to the period after the takeover moments leads to policies with limited safety margins, and passive preference optimization lacks active exploration for optimal performance. In this paper, we propose TakeVLA, a novel VLA post-training framework that overcomes these shortcomings through two complementary innovations. First, we introduce pre-takeover language supervision, which allows the VLA to learn from mistakes proactively. By explicitly teaching the model about what to do in error-prone situations, we cultivate a precautionary mindset that anticipates hazards early and substantially enlarges safety margins. Second, we propose Scenario Dreaming, a reinforcement fine-tuning paradigm that operates in reconstruceted takeover scenarios, encouraging active exploration beyond mere preference fitting. Experiments on the Bench2Drive benchmark demonstrate that TakeVLA achieves state-of-the-art closed-loop performance, surpassing the strong VLA baseline SimLingo by 4.93 in driving score, with an enhanced safety margin as evidenced by an 11.76% increase in average TTC.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

distribution shift

takeover data

safety margin

preference optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

pre-takeover language supervision

Scenario Dreaming

Vision-Language-Action (VLA)