UniMM-V2X: MoE-Enhanced Multi-Level Fusion for End-to-End Cooperative Autonomous Driving

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

Existing end-to-end multi-agent autonomous driving approaches primarily focus on perception-level collaboration while neglecting consistency with motion planning and control, and fail to fully integrate bird’s-eye-view (BEV) representation with cross-agent interaction. This paper proposes the first end-to-end jointly optimized multi-agent collaborative framework. First, it introduces a dynamic query sharing mechanism within a unified BEV space, enabling hierarchical coordination across perception, prediction, and planning. Second, it pioneers the integration of Mixture-of-Experts (MoE) architecture into both encoder and decoder, facilitating task-adaptive feature representation and diverse motion modeling via multi-level fusion. Third, extensive experiments on DAIR-V2X demonstrate state-of-the-art performance: 39.7% improvement in perception accuracy over UniV2X, 7.2% reduction in trajectory prediction error, and 33.2% gain in planning performance.

Technology Category

Application Category

📝 Abstract

Autonomous driving holds transformative potential but remains fundamentally constrained by the limited perception and isolated decision-making with standalone intelligence. While recent multi-agent approaches introduce cooperation, they often focus merely on perception-level tasks, overlooking the alignment with downstream planning and control, or fall short in leveraging the full capacity of the recent emerging end-to-end autonomous driving. In this paper, we present UniMM-V2X, a novel end-to-end multi-agent framework that enables hierarchical cooperation across perception, prediction, and planning. At the core of our framework is a multi-level fusion strategy that unifies perception and prediction cooperation, allowing agents to share queries and reason cooperatively for consistent and safe decision-making. To adapt to diverse downstream tasks and further enhance the quality of multi-level fusion, we incorporate a Mixture-of-Experts (MoE) architecture to dynamically enhance the BEV representations. We further extend MoE into the decoder to better capture diverse motion patterns. Extensive experiments on the DAIR-V2X dataset demonstrate our approach achieves state-of-the-art (SOTA) performance with a 39.7% improvement in perception accuracy, a 7.2% reduction in prediction error, and a 33.2% improvement in planning performance compared with UniV2X, showcasing the strength of our MoE-enhanced multi-level cooperative paradigm.

Problem

Research questions and friction points this paper is trying to address.

Enables hierarchical cooperation across perception, prediction, and planning for autonomous driving

Unifies perception and prediction cooperation through multi-level fusion strategy

Enhances BEV representations using Mixture-of-Experts architecture for diverse tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts architecture enhances BEV representations

Multi-level fusion strategy unifies perception and prediction cooperation

MoE extension to decoder captures diverse motion patterns

🔎 Similar Papers

V2X Cooperative Perception for Autonomous Driving: Recent Advances and Challenges