Uni-World VLA: Interleaved World Modeling and Planning for Autonomous Driving

📅 2026-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the disconnect between environmental evolution reasoning and action planning in existing autonomous driving approaches, which often leads to inconsistencies between open-loop predictions and actual decision-making. To bridge this gap, the authors propose a unified vision-language-action (VLA) model that alternately generates future video frames and planning trajectories, thereby establishing a closed-loop interaction between world modeling and control. For the first time, scene prediction and motion planning are tightly interleaved within a single generative framework, augmented with monocular depth estimation to enhance geometric awareness. This integration significantly improves long-horizon predictive capabilities. Evaluated on the NAVSIM benchmark, the model achieves high-fidelity future frame generation while demonstrating competitive performance in closed-loop planning tasks.
📝 Abstract
Autonomous driving requires reasoning about how the environment evolves and planning actions accordingly. Existing world-model-based approaches typically predict future scenes first and plan afterwards, resulting in open-loop imagination that may drift from the actual decision process. In this paper, we present Uni-World VLA, a unified vision-language-action (VLA) model that tightly interleaves future frame prediction and trajectory planning. Instead of generating a full world rollout before planning, our model alternates between predicting future frames and ego actions step by step, allowing planning decisions to be continuously conditioned on the imagined future observations. This interleaved generation forms a closed-loop interaction between world modeling and control, enabling more adaptive decision-making in dynamic traffic scenarios. In addition, we incorporate monocular depth information into frames to provide stronger geometric cues for world modeling, improving long-horizon scene prediction. Experiments on the NAVSIM benchmark show that our approach achieves competitive closed-loop planning performance while producing high-fidelity future frame predictions. These results demonstrate that tightly coupling world prediction and planning is a promising direction for scalable VLA driving systems.
Problem

Research questions and friction points this paper is trying to address.

autonomous driving
world modeling
trajectory planning
closed-loop planning
dynamic traffic scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

interleaved world modeling
closed-loop planning
vision-language-action (VLA)
monocular depth
autonomous driving
Q
Qiqi Liu
Fudan University
H
Huan Xu
Li Auto Inc.
J
Jingyu Li
Fudan University
B
Bin Sun
Li Auto Inc.
Z
Zhihui Hao
Li Auto Inc.
D
Dangen She
Li Auto Inc.
Xiatian Zhu
Xiatian Zhu
University of Surrey
Machine LearningComputer Vision
Li Zhang
Li Zhang
Professor, Fudan University & Shanghai Innovation Institute
computer visionautonomous drivingworld modelembodied ai