SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work addresses the performance degradation of autonomous driving systems in long-tail scenarios, where a disconnect between high-level semantic reasoning and low-level reactive control often leads to suboptimal behavior. To bridge this gap, the authors propose a fine-grained language instruction interface that leverages the commonsense reasoning capabilities of vision-language models (VLMs) to generate precise linguistic guidance, which is then integrated into a vision-language-action (VLA) driving policy. This tight alignment between high-level reasoning and low-level control is achieved through language-augmented driving data and an instruction generation mechanism, significantly enhancing model robustness in complex and rare scenarios. Experimental results demonstrate that the proposed approach improves overall closed-loop driving scores by 4.77 points and achieves an 8.04-point gain on the long-tail subset, outperforming current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

A fundamental challenge in autonomous driving is the integration of high-level, semantic reasoning for long-tail events with low-level, reactive control for robust driving. While large vision-language models (VLMs) trained on web-scale data offer powerful common-sense reasoning, they lack the grounded experience necessary for safe vehicle control. We posit that an effective autonomous agent should leverage the world knowledge of VLMs to guide a steerable driving policy toward robust control in driving scenarios. To this end, we propose SteerVLA, which leverages the reasoning capabilities of VLMs to produce fine-grained language instructions that steer a vision-language-action (VLA) driving policy. Key to our method is this rich language interface between the high-level VLM and low-level VLA, which allows the high-level policy to more effectively ground its reasoning in the control outputs of the low-level policy. To provide fine-grained language supervision aligned with vehicle control, we leverage a VLM to augment existing driving data with detailed language annotations, which we find to be essential for effective reasoning and steerability. We evaluate SteerVLA on a challenging closed-loop benchmark, where it outperforms state-of-the-art methods by 4.77 points in overall driving score and by 8.04 points on a long-tail subset. The project website is available at: https://steervla.github.io/.

Problem

Research questions and friction points this paper is trying to address.

autonomous driving

long-tail scenarios

vision-language models

driving policy

semantic reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action Models

Long-Tail Driving Scenarios

Language-Guided Control