🤖 AI Summary
To address poor generalization of autonomous driving systems in long-tail scenarios—stemming from insufficient world knowledge and visual causal reasoning capabilities—this paper proposes an integrated understanding-generation-planning framework. It jointly processes multi-frame visual inputs and natural language instructions to simultaneously generate interpretable reasoning chains, physically consistent trajectories, and coherent future video predictions. Methodologically, we introduce a novel unified three-in-one collaborative architecture and a four-stage progressive training paradigm. For the first time, our approach integrates the semantic reasoning capacity of pre-trained vision-language models (VLMs) with the visual dynamic modeling capability of video diffusion models, while augmenting the planning module with chain-of-thought (CoT) reasoning. Leveraging a mixture-of-experts structure and joint fine-tuning on diverse autonomous driving datasets, our method achieves state-of-the-art performance across perception, reasoning, and decision-making tasks, significantly improving generalization and robustness in rare traffic scenarios.
📝 Abstract
Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces interpretable chain-of-thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four-stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state-of-the-art performance in perception, reasoning, and decision-making, with superior generalization to challenging long-tail situations.