Polysemous Language Gaussian Splatting via Matching-based Mask Lifting

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three key bottlenecks in 3D Gaussian Splatting (3DGS) scenes—lack of plug-and-play capability for 2D open-vocabulary understanding, insufficient unambiguous semantic representation, and cross-view semantic inconsistency—this paper proposes MUSplat, a training-free framework. MUSplat integrates multi-granularity mask-guided semantic enhancement with cross-view consistency matching, semantic-entropy-driven boundary optimization, geometry-aware opacity adaptation, and robust text feature distillation from vision-language models. It achieves the first open-vocabulary 3D semantic modeling without fine-tuning. Leveraging only pre-trained 2D segmentation models to generate initial masks, MUSplat supports multi-concept semantic expression. Evaluated on open-vocabulary 3D object selection and semantic segmentation, it significantly outperforms mainstream supervised baselines. Scene adaptation time is reduced from hours to minutes, striking an effective balance between efficiency and semantic accuracy.

Technology Category

Application Category

📝 Abstract
Lifting 2D open-vocabulary understanding into 3D Gaussian Splatting (3DGS) scenes is a critical challenge. However, mainstream methods suffer from three key flaws: (i) their reliance on costly per-scene retraining prevents plug-and-play application; (ii) their restrictive monosemous design fails to represent complex, multi-concept semantics; and (iii) their vulnerability to cross-view semantic inconsistencies corrupts the final semantic representation. To overcome these limitations, we introduce MUSplat, a training-free framework that abandons feature optimization entirely. Leveraging a pre-trained 2D segmentation model, our pipeline generates and lifts multi-granularity 2D masks into 3D, where we estimate a foreground probability for each Gaussian point to form initial object groups. We then optimize the ambiguous boundaries of these initial groups using semantic entropy and geometric opacity. Subsequently, by interpreting the object's appearance across its most representative viewpoints, a Vision-Language Model (VLM) distills robust textual features that reconciles visual inconsistencies, enabling open-vocabulary querying via semantic matching. By eliminating the costly per-scene training process, MUSplat reduces scene adaptation time from hours to mere minutes. On benchmark tasks for open-vocabulary 3D object selection and semantic segmentation, MUSplat outperforms established training-based frameworks while simultaneously addressing their monosemous limitations.
Problem

Research questions and friction points this paper is trying to address.

Lifting 2D open-vocabulary understanding into 3D Gaussian Splatting scenes
Overcoming reliance on costly per-scene retraining for 3D semantic representation
Resolving cross-view semantic inconsistencies in 3D open-vocabulary segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework using pre-trained 2D segmentation
Multi-granularity mask lifting with semantic entropy optimization
Vision-language model distills robust textual features
🔎 Similar Papers
No similar papers found.