Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance

📅 2024-08-21
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high cost of LiDAR annotation and scarcity of supervised data in vision-only 3D semantic scene completion (SSC), this paper proposes the first semi-supervised framework leveraging 2D vision foundation models (SAM and CLIP) to guide 3D reconstruction via cross-modal geometric and semantic priors. Our method integrates feature distillation, multi-scale 3D decoding, optimized pseudo-labeling, and consistency regularization—compatible with mainstream architectures such as 2D→3D lifting and 3D→2D Transformers. On SemanticKITTI and NYUv2, it achieves 85% of fully supervised performance using only 10% labeled data, substantially reducing annotation dependency. The core contribution is the pioneering cross-modal guidance paradigm that harnesses 2D foundation models to drive 3D SSC—demonstrating strong generalization and practical deployability, thereby advancing vision-only 3D spatial understanding toward real-world applicability.

Technology Category

Application Category

📝 Abstract
Accurate prediction of 3D semantic occupancy from 2D visual images is vital in enabling autonomous agents to comprehend their surroundings for planning and navigation. State-of-the-art methods typically employ fully supervised approaches, necessitating a huge labeled dataset acquired through expensive LiDAR sensors and meticulous voxel-wise labeling by human annotators. The resource-intensive nature of this annotating process significantly hampers the application and scalability of these methods. We introduce a novel semi-supervised framework to alleviate the dependency on densely annotated data. Our approach leverages 2D foundation models to generate essential 3D scene geometric and semantic cues, facilitating a more efficient training process. Our framework exhibits notable properties: (1) Generalizability, applicable to various 3D semantic scene completion approaches, including 2D-3D lifting and 3D-2D transformer methods. (2) Effectiveness, as demonstrated through experiments on SemanticKITTI and NYUv2, wherein our method achieves up to 85% of the fully-supervised performance using only 10% labeled data. This approach not only reduces the cost and labor associated with data annotation but also demonstrates the potential for broader adoption in camera-based systems for 3D semantic occupancy prediction.
Problem

Research questions and friction points this paper is trying to address.

Autonomous Robots
2D to 3D Understanding
Efficient Navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

semi-supervised learning
2D to 3D conversion
data efficiency
D
Duc-Hai Pham
VinAI Research, Vietnam
Duc Dung Nguyen
Duc Dung Nguyen
Ho Chi Minh City University of Technology (HCMUT)
Computer VisionSound ProcessingDeep LearningNLP
H
H. Pham
VinAI Research, Vietnam
H
Ho Lai Tuan
VinAI Research, Vietnam
P
P. Nguyen
VinAI Research, Vietnam
K
Khoi Nguyen
VinAI Research, Vietnam
Rang Nguyen
Rang Nguyen
VinAI Research, Vietnam