PGOV3D: Open-Vocabulary 3D Semantic Segmentation with Partial-to-Global Curriculum

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing open-vocabulary 3D semantic segmentation methods treat multi-view images merely as intermediaries for transferring text features (e.g., from CLIP), neglecting their intrinsic semantics and inter-view correspondences—limiting performance gains. To address this, we propose a “local-to-global” curriculum learning framework: first pretraining on semantically rich local point clouds, then fine-tuning on full scenes; high-quality open-vocabulary pseudo-labels are generated via multimodal large language models and 2D segmentation foundation models to bridge the local–global semantic gap. Furthermore, we introduce an inter-frame consistency module that explicitly enforces cross-view feature alignment. Our method achieves state-of-the-art performance on ScanNet, ScanNet200, and S3DIS benchmarks, demonstrating the effectiveness of fully leveraging both semantic and geometric correlations inherent in multi-view imagery.

Technology Category

Application Category

📝 Abstract

Existing open-vocabulary 3D semantic segmentation methods typically supervise 3D segmentation models by merging text-aligned features (e.g., CLIP) extracted from multi-view images onto 3D points. However, such approaches treat multi-view images merely as intermediaries for transferring open-vocabulary information, overlooking their rich semantic content and cross-view correspondences, which limits model effectiveness. To address this, we propose PGOV3D, a novel framework that introduces a Partial-to-Global curriculum for improving open-vocabulary 3D semantic segmentation. The key innovation lies in a two-stage training strategy. In the first stage, we pre-train the model on partial scenes that provide dense semantic information but relatively simple geometry. These partial point clouds are derived from multi-view RGB-D inputs via pixel-wise depth projection. To enable open-vocabulary learning, we leverage a multi-modal large language model (MLLM) and a 2D segmentation foundation model to generate open-vocabulary labels for each viewpoint, offering rich and aligned supervision. An auxiliary inter-frame consistency module is introduced to enforce feature consistency across varying viewpoints and enhance spatial understanding. In the second stage, we fine-tune the model on complete scene-level point clouds, which are sparser and structurally more complex. We aggregate the partial vocabularies associated with each scene and generate pseudo labels using the pre-trained model, effectively bridging the semantic gap between dense partial observations and large-scale 3D environments. Extensive experiments on ScanNet, ScanNet200, and S3DIS benchmarks demonstrate that PGOV3D achieves competitive performance in open-vocabulary 3D semantic segmentation.

Problem

Research questions and friction points this paper is trying to address.

Improves open-vocabulary 3D segmentation using partial-to-global curriculum

Leverages multi-view images for rich semantics and cross-view consistency

Bridges semantic gap between dense partial scenes and large-scale 3D environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training strategy for 3D segmentation

Multi-modal models generate open-vocabulary labels

Inter-frame consistency enhances spatial understanding

🔎 Similar Papers

Open-Ended 3D Point Cloud Instance Segmentation