Unleashing the Multi-View Fusion Potential: Noise Correction in VLM for Open-Vocabulary 3D Scene Understanding

📅 2025-06-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the degradation of multi-view fusion performance in open-vocabulary 3D scene understanding caused by inherent noise in vision-language models (VLMs), this paper proposes MVOV3D—a training-free method that jointly models CLIP-extracted region-level image-text features and point cloud geometric priors for robust cross-modal alignment and denoised fusion. Departing from conventional supervised fine-tuning paradigms, MVOV3D leverages 3D geometric structure to guide weighted aggregation of 2D multi-view features, effectively suppressing VLM-induced noise. Evaluated on ScanNet200 and Matterport160, MVOV3D achieves 54.7% and 52.3% mIoU, respectively—surpassing the best trained methods by 14.7% and 16.2% absolute gains—and establishes new state-of-the-art performance in open-vocabulary 3D semantic segmentation.

Technology Category

Application Category

📝 Abstract
Recent open-vocabulary 3D scene understanding approaches mainly focus on training 3D networks through contrastive learning with point-text pairs or by distilling 2D features into 3D models via point-pixel alignment. While these methods show considerable performance in benchmarks with limited vocabularies, they struggle to handle diverse object categories as the limited amount of 3D data upbound training strong open-vocabulary 3d models. We observe that 2D multi-view fusion methods take precedence in understanding diverse concepts in 3D scenes. However, inherent noises in vision-language models lead multi-view fusion to sub-optimal performance. To this end, we introduce MVOV3D, a novel approach aimed at unleashing the potential of 2D multi-view fusion for open-vocabulary 3D scene understanding. We focus on reducing the inherent noises without training, thereby preserving the generalizability while enhancing open-world capabilities. Specifically, MVOV3D improves multi-view 2D features by leveraging precise region-level image features and text features encoded by CLIP encoders and incorporates 3D geometric priors to optimize multi-view fusion. Extensive experiments on various datasets demonstrate the effectiveness of our method. Notably, our MVOV3D achieves a new record with 14.7% mIoU on ScanNet200 and 16.2% mIoU on Matterport160 for challenge open-vocabulary semantic segmentation, outperforming current leading trained 3D networks by a significant margin.
Problem

Research questions and friction points this paper is trying to address.

Reducing inherent noise in vision-language models for 3D scenes
Enhancing multi-view fusion without training for open-vocabulary understanding
Improving 2D feature integration with 3D geometric priors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging CLIP for noise-free multi-view fusion
Incorporating 3D geometric priors optimization
Enhancing open-vocabulary without model training
🔎 Similar Papers
No similar papers found.