🤖 AI Summary
This work addresses the limitations of existing vision-language-action (VLA) autonomous driving models, which are constrained by sparse action supervision and thus fail to fully exploit scene understanding capabilities, as well as world models based on pixel-level reconstruction that neglect semantic representation learning. The authors propose an implicit visual representation-enhanced VLA architecture that performs future scene prediction in the high-dimensional latent space provided by a pretrained vision backbone and jointly models future scenes and motion trajectories within a unified embedding space. By employing a two-stage trajectory decoding strategy, the method explicitly leverages the learned future latent representations, eschewing inefficient autoregressive generation and enabling future-aware decision-making through a single forward pass. Evaluated on the Bench2Drive benchmark, this approach significantly outperforms both pure action-supervised models and image-reconstruction-based world models, substantially improving closed-loop driving performance.
📝 Abstract
Vision-Language-Action (VLA) models have emerged as a promising framework for end-to-end autonomous driving. However, existing VLAs typically rely on sparse action supervision, which underutilizes their powerful scene understanding and reasoning capabilities. Recent attempts to incorporate dense visual supervision via world modeling often overemphasize pixel-level image reconstruction, neglecting semantically meaningful scene representation learning. In this work, we propose LVDrive, a Latent Visual representation enhanced VLA framework for autonomous driving. LVDrive introduces a future scene prediction task into the VLA paradigm, where future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. Departing from inefficient autoregressive generation, we jointly model future scene and motion prediction within a unified embedding space, processed in a single forward pass to conduct the future-aware reasoning. We further design a two-stage trajectory decoding strategy that explicitly leverages the learned latent future representations to refine trajectory generation. Extensive experiments on the challenging Bench2Drive benchmark demonstrate that LVDrive achieves significant improvements in closed-loop driving performance, outperforming both action supervised methods and image-reconstruction-based world model approaches.