UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address poor generalization of autonomous driving systems in long-tail scenarios—stemming from insufficient world knowledge and visual causal reasoning capabilities—this paper proposes an integrated understanding-generation-planning framework. It jointly processes multi-frame visual inputs and natural language instructions to simultaneously generate interpretable reasoning chains, physically consistent trajectories, and coherent future video predictions. Methodologically, we introduce a novel unified three-in-one collaborative architecture and a four-stage progressive training paradigm. For the first time, our approach integrates the semantic reasoning capacity of pre-trained vision-language models (VLMs) with the visual dynamic modeling capability of video diffusion models, while augmenting the planning module with chain-of-thought (CoT) reasoning. Leveraging a mixture-of-experts structure and joint fine-tuning on diverse autonomous driving datasets, our method achieves state-of-the-art performance across perception, reasoning, and decision-making tasks, significantly improving generalization and robustness in rare traffic scenarios.

Technology Category

Application Category

📝 Abstract
Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces interpretable chain-of-thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four-stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state-of-the-art performance in perception, reasoning, and decision-making, with superior generalization to challenging long-tail situations.
Problem

Research questions and friction points this paper is trying to address.

Addresses autonomous driving limitations in long-tail scenarios
Unifies reasoning, generation, and planning for end-to-end driving
Enhances planning with visual dynamics and semantic reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework synergizes reasoning, generation, planning
Integrates pre-trained VLMs and video models for dynamics
Uses hybrid expert architecture with multi-stage training
🔎 Similar Papers
No similar papers found.
H
Hao Lu
ByteDance Seed
Ziyang Liu
Ziyang Liu
Research Fellow, Harvard Medical School; PhD, Tsinghua University
AI4BioGraph EmbeddingLarge Language Model
G
Guangfeng Jiang
ByteDance Seed
Y
Yuanfei Luo
ByteDance Seed
S
Sheng Chen
ByteDance Seed
Y
Yangang Zhang
ByteDance Seed
Ying-Cong Chen
Ying-Cong Chen
Hong Kong University of Science and Technology (Guangzhou)
Computer Vision and Pattern Recognition