🤖 AI Summary
This work proposes GeRo, a novel framework that addresses the limitations of existing end-to-end autonomous driving systems, which rely on sparse trajectory annotations and struggle to support long-horizon, multi-agent scenarios with language-guided reasoning. GeRo uniquely integrates language-conditioned generation with an autoregressive rolling policy, jointly generating future latent representations and textual responses from multi-view images, scene descriptions, and ego-vehicle actions to enable temporally consistent and language-aligned multi-step reasoning. A rolling consistency loss is introduced to mitigate prediction drift, enhancing zero-shot robustness and interpretability. Evaluated on Bench2Drive, GeRo achieves state-of-the-art performance in both closed-loop and open-loop tasks, improving driving scores by 15.7% and success rates by 26.2%.
📝 Abstract
Vision-Language-Action (VLA) models are emerging as highly effective planning models for end-to-end autonomous driving systems. However, current works mostly rely on imitation learning from sparse trajectory annotations and under-utilize their potential as generative models. We propose Generative Scenario Rollouts (GeRo), a plug-and-play framework for VLA models that jointly performs planning and generation of language-grounded future traffic scenes through an autoregressive rollout strategy. First, a VLA model is trained to encode ego vehicle and agent dynamics into latent tokens under supervision from planning, motion, and language tasks, facilitating text-aligned generation. Next, GeRo performs language-conditioned autoregressive generation. Given multi-view images, a scenario description, and ego-action questions, it generates future latent tokens and textual responses to guide long-horizon rollouts. A rollout-consistency loss stabilizes predictions using ground truth or pseudo-labels, mitigating drift and preserving text-action alignment. This design enables GeRo to perform temporally consistent, language-grounded rollouts that support long-horizon reasoning and multi-agent planning. On Bench2Drive, GeRo improves driving score and success rate by +15.7 and +26.2, respectively. By integrating reinforcement learning with generative rollouts, GeRo achieves state-of-the-art closed-loop and open-loop performance, demonstrating strong zero-shot robustness. These results highlight the promise of generative, language-conditioned reasoning as a foundation for safer and more interpretable end-to-end autonomous driving.