MoE3D: Mixture of Experts meets Multi-Modal 3D Understanding

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

To address the strong modality heterogeneity and fusion difficulty in multimodal 3D understanding, this paper proposes a Mixture-of-Experts (MoE)-based Transformer architecture. Methodologically, it introduces specialized expert networks for unimodal processing and cross-modal interaction, employs a Top-1 gating mechanism for efficient feature routing, and designs a multimodal information aggregation module; additionally, it adopts a progressive pretraining strategy guided by 2D semantic priors to improve initialization quality and cross-modal adaptation. The core contribution lies in the systematic integration of MoE into 3D multimodal fusion, significantly enhancing the model’s capacity to capture modality-specific characteristics and discrepancies. The approach achieves state-of-the-art performance across four mainstream 3D understanding benchmarks, notably outperforming prior work by 6.1 mIoU on Multi3DRefer, thereby demonstrating both effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract

Multi-modal 3D understanding is a fundamental task in computer vision. Previous multi-modal fusion methods typically employ a single, dense fusion network, struggling to handle the significant heterogeneity and complexity across modalities, leading to suboptimal performance. In this paper, we propose MoE3D, which integrates Mixture of Experts (MoE) into the multi-modal learning framework. The core is that we deploy a set of specialized "expert" networks, each adept at processing a specific modality or a mode of cross-modal interaction. Specifically, the MoE-based transformer is designed to better utilize the complementary information hidden in the visual features. Information aggregation module is put forward to further enhance the fusion performance. Top-1 gating is employed to make one expert process features with expert groups, ensuring high efficiency. We further propose a progressive pre-training strategy to better leverage the semantic and 2D prior, thus equipping the network with good initialization. Our MoE3D achieves competitive performance across four prevalent 3D understanding tasks. Notably, our MoE3D surpasses the top-performing counterpart by 6.1 mIoU on Multi3DRefer.

Problem

Research questions and friction points this paper is trying to address.

Enhances multi-modal 3D understanding by integrating Mixture of Experts

Addresses modality heterogeneity with specialized expert networks for fusion

Improves performance on 3D tasks via progressive pre-training and aggregation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts for multi-modal fusion

Top-1 gating for efficient expert selection

Progressive pre-training with 2D prior knowledge

🔎 Similar Papers

MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection