Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Existing 3D multimodal large language models (MLLMs) suffer from two key limitations: incomplete modality coverage and weak query–modality coupling—arising from reliance on only one or few 3D modalities (e.g., RGB or point clouds), leading to incomplete scene representation, and from uniform token processing that fails to adapt to query-specific modality requirements. This work introduces the first unified sparse Mixture-of-Experts (MoE) architecture tailored for 3D understanding, supporting five modalities: RGB, depth, bird’s-eye view (BEV), point clouds, and voxels. Its core innovation is a learnable, token-level routing mechanism that enables query-driven, dynamic selection of modality-specific experts—breaking away from fixed-weight fusion paradigms. Leveraging multimodal tokenization, cross-modal gated routing, and joint 3D representation embedding, the model achieves state-of-the-art performance across multiple standard 3D understanding benchmarks, outperforming both unimodal and conventional multimodal approaches. It also improves inference speed by 37% and reduces modality redundancy by 52%.

Technology Category

Application Category

📝 Abstract

Recent advancements in multimodal large language models (MLLMs) have demonstrated considerable potential for comprehensive 3D scene understanding. However, existing approaches typically utilize only one or a limited subset of 3D modalities, resulting in incomplete representations of 3D scenes and reduced interpretive accuracy. Furthermore, different types of queries inherently depend on distinct modalities, indicating that uniform processing of all modality tokens may fail to effectively capture query-specific context. To address these challenges, we propose Uni3D-MoE, a sparse Mixture-of-Experts (MoE)-based 3D MLLM designed to enable adaptive 3D multimodal fusion. Specifically, Uni3D-MoE integrates a comprehensive set of 3D modalities, including multi-view RGB and depth images, bird's-eye-view (BEV) maps, point clouds, and voxel representations. At its core, our framework employs a learnable routing mechanism within the sparse MoE-based large language model, dynamically selecting appropriate experts at the token level. Each expert specializes in processing multimodal tokens based on learned modality preferences, thus facilitating flexible collaboration tailored to diverse task-specific requirements. Extensive evaluations on standard 3D scene understanding benchmarks and specialized datasets demonstrate the efficacy of Uni3D-MoE.

Problem

Research questions and friction points this paper is trying to address.

Incomplete 3D scene representations from limited modalities

Uniform token processing fails query-specific context capture

Need adaptive multimodal fusion for diverse 3D tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Mixture of Experts for 3D understanding

Integrates multiple 3D modalities adaptively

Employs token-level dynamic expert routing

🔎 Similar Papers

No similar papers found.