UniMM-V2X: MoE-Enhanced Multi-Level Fusion for End-to-End Cooperative Autonomous Driving

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing end-to-end multi-agent autonomous driving approaches primarily focus on perception-level collaboration while neglecting consistency with motion planning and control, and fail to fully integrate bird’s-eye-view (BEV) representation with cross-agent interaction. This paper proposes the first end-to-end jointly optimized multi-agent collaborative framework. First, it introduces a dynamic query sharing mechanism within a unified BEV space, enabling hierarchical coordination across perception, prediction, and planning. Second, it pioneers the integration of Mixture-of-Experts (MoE) architecture into both encoder and decoder, facilitating task-adaptive feature representation and diverse motion modeling via multi-level fusion. Third, extensive experiments on DAIR-V2X demonstrate state-of-the-art performance: 39.7% improvement in perception accuracy over UniV2X, 7.2% reduction in trajectory prediction error, and 33.2% gain in planning performance.

Technology Category

Application Category

📝 Abstract
Autonomous driving holds transformative potential but remains fundamentally constrained by the limited perception and isolated decision-making with standalone intelligence. While recent multi-agent approaches introduce cooperation, they often focus merely on perception-level tasks, overlooking the alignment with downstream planning and control, or fall short in leveraging the full capacity of the recent emerging end-to-end autonomous driving. In this paper, we present UniMM-V2X, a novel end-to-end multi-agent framework that enables hierarchical cooperation across perception, prediction, and planning. At the core of our framework is a multi-level fusion strategy that unifies perception and prediction cooperation, allowing agents to share queries and reason cooperatively for consistent and safe decision-making. To adapt to diverse downstream tasks and further enhance the quality of multi-level fusion, we incorporate a Mixture-of-Experts (MoE) architecture to dynamically enhance the BEV representations. We further extend MoE into the decoder to better capture diverse motion patterns. Extensive experiments on the DAIR-V2X dataset demonstrate our approach achieves state-of-the-art (SOTA) performance with a 39.7% improvement in perception accuracy, a 7.2% reduction in prediction error, and a 33.2% improvement in planning performance compared with UniV2X, showcasing the strength of our MoE-enhanced multi-level cooperative paradigm.
Problem

Research questions and friction points this paper is trying to address.

Enables hierarchical cooperation across perception, prediction, and planning for autonomous driving
Unifies perception and prediction cooperation through multi-level fusion strategy
Enhances BEV representations using Mixture-of-Experts architecture for diverse tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts architecture enhances BEV representations
Multi-level fusion strategy unifies perception and prediction cooperation
MoE extension to decoder captures diverse motion patterns
🔎 Similar Papers
No similar papers found.
Z
Ziyi Song
Department of Electronic Engineering, Tsinghua University
C
Chen Xia
Department of Electronic Engineering, Tsinghua University
C
Chenbing Wang
Department of Electronic Engineering, Tsinghua University
H
Haibao Yu
The University of Hong Kong
S
Sheng Zhou
Department of Electronic Engineering, Tsinghua University; State Key Laboratory of Intelligent Green Vehicle and Mobility, Tsinghua University
Zhisheng Niu
Zhisheng Niu
Professor of Electronic Engineering, Tsinghua University
Green CommunicationRadio Resource ManagementQueueing Theory