Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding

📅 2025-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-vocabulary 3D scene understanding faces challenges in fusing knowledge from heterogeneous foundation models and aligning their disparate representations. Method: We propose the first framework to synergistically integrate cross-modal models—including CLIP, DINOv2, and Stable Diffusion—via a deterministic-uncertainty joint estimation mechanism that adaptively distills and harmonizes multi-model 2D features. It uniquely unifies semantic priors from vision-language models (VLMs) with geometric awareness from spatially aware visual models, leveraging cross-modal alignment, multi-model collaborative distillation, and joint 3D point cloud–text embedding learning to establish a unified representation space. Contribution/Results: Our approach achieves significant gains in open-vocabulary 3D semantic segmentation on ScanNetV2 and Matterport3D, demonstrating strong cross-domain alignment capability and superior spatial perception performance.

Technology Category

Application Category

📝 Abstract
The lack of a large-scale 3D-text corpus has led recent works to distill open-vocabulary knowledge from vision-language models (VLMs). owever, these methods typically rely on a single VLM to align the feature spaces of 3D models within a common language space, which limits the potential of 3D models to leverage the diverse spatial and semantic capabilities encapsulated in various foundation models. In this paper, we propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D, the first model to integrate multiple foundation models-such as CLIP, DINOv2, and Stable Diffusion-into 3D scene understanding. We further introduce a deterministic uncertainty estimation to adaptively distill and harmonize the heterogeneous 2D feature embeddings from these models. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties across diverse semantic and geometric sensitivities, helping to reconcile heterogeneous representations during training. Extensive experiments on ScanNetV2 and Matterport3D demonstrate that our method not only advances open-vocabulary segmentation but also achieves robust cross-domain alignment and competitive spatial perception capabilities. The code will be available at href{https://github.com/TyroneLi/CUA_O3D}{CUA_O3D}.
Problem

Research questions and friction points this paper is trying to address.

Integrate multiple foundation models for 3D scene understanding
Adaptively distill heterogeneous 2D feature embeddings
Reconcile semantic and geometric knowledge uncertainties
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates multiple foundation models for 3D understanding
Uses deterministic uncertainty estimation for feature harmonization
Combines semantic and geometric knowledge from VLMs
🔎 Similar Papers
No similar papers found.