🤖 AI Summary
To address the high cost of LiDAR annotation and scarcity of supervised data in vision-only 3D semantic scene completion (SSC), this paper proposes the first semi-supervised framework leveraging 2D vision foundation models (SAM and CLIP) to guide 3D reconstruction via cross-modal geometric and semantic priors. Our method integrates feature distillation, multi-scale 3D decoding, optimized pseudo-labeling, and consistency regularization—compatible with mainstream architectures such as 2D→3D lifting and 3D→2D Transformers. On SemanticKITTI and NYUv2, it achieves 85% of fully supervised performance using only 10% labeled data, substantially reducing annotation dependency. The core contribution is the pioneering cross-modal guidance paradigm that harnesses 2D foundation models to drive 3D SSC—demonstrating strong generalization and practical deployability, thereby advancing vision-only 3D spatial understanding toward real-world applicability.
📝 Abstract
Accurate prediction of 3D semantic occupancy from 2D visual images is vital in enabling autonomous agents to comprehend their surroundings for planning and navigation. State-of-the-art methods typically employ fully supervised approaches, necessitating a huge labeled dataset acquired through expensive LiDAR sensors and meticulous voxel-wise labeling by human annotators. The resource-intensive nature of this annotating process significantly hampers the application and scalability of these methods. We introduce a novel semi-supervised framework to alleviate the dependency on densely annotated data. Our approach leverages 2D foundation models to generate essential 3D scene geometric and semantic cues, facilitating a more efficient training process. Our framework exhibits notable properties: (1) Generalizability, applicable to various 3D semantic scene completion approaches, including 2D-3D lifting and 3D-2D transformer methods. (2) Effectiveness, as demonstrated through experiments on SemanticKITTI and NYUv2, wherein our method achieves up to 85% of the fully-supervised performance using only 10% labeled data. This approach not only reduces the cost and labor associated with data annotation but also demonstrates the potential for broader adoption in camera-based systems for 3D semantic occupancy prediction.