🤖 AI Summary
This paper addresses unsupervised single-image Semantic Scene Completion (SSC), i.e., jointly inferring dense 3D geometry and semantics from a single RGB image without any 3D or 2D ground-truth annotations. To this end, we propose SceneDINO—the first SSC framework integrating self-supervised visual representation learning (based on DINO) with multi-view consistency constraints for fully self-supervised training. We further design a 3D feature distillation mechanism to transfer unsupervised 2D semantic knowledge into 3D space, enabling efficient single-pass feedforward inference. 3D feature quality is evaluated via linear probing. Experiments demonstrate that SceneDINO achieves state-of-the-art segmentation accuracy on multiple unsupervised 3D/2D scene understanding benchmarks. Its learned 3D features attain performance on par with supervised methods under linear probing, while exhibiting strong domain generalization and multi-view consistency.
📝 Abstract
Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.