UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address poor generalization of autonomous driving systems in long-tail scenarios—stemming from insufficient world knowledge and visual causal reasoning capabilities—this paper proposes an integrated understanding-generation-planning framework. It jointly processes multi-frame visual inputs and natural language instructions to simultaneously generate interpretable reasoning chains, physically consistent trajectories, and coherent future video predictions. Methodologically, we introduce a novel unified three-in-one collaborative architecture and a four-stage progressive training paradigm. For the first time, our approach integrates the semantic reasoning capacity of pre-trained vision-language models (VLMs) with the visual dynamic modeling capability of video diffusion models, while augmenting the planning module with chain-of-thought (CoT) reasoning. Leveraging a mixture-of-experts structure and joint fine-tuning on diverse autonomous driving datasets, our method achieves state-of-the-art performance across perception, reasoning, and decision-making tasks, significantly improving generalization and robustness in rare traffic scenarios.

Technology Category

Application Category

📝 Abstract

Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces interpretable chain-of-thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four-stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state-of-the-art performance in perception, reasoning, and decision-making, with superior generalization to challenging long-tail situations.

Problem

Research questions and friction points this paper is trying to address.

Addresses autonomous driving limitations in long-tail scenarios

Unifies reasoning, generation, and planning for end-to-end driving

Enhances planning with visual dynamics and semantic reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework synergizes reasoning, generation, planning

Integrates pre-trained VLMs and video models for dynamics

Uses hybrid expert architecture with multi-stage training

🔎 Similar Papers

No similar papers found.

Authors to Follow