Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion

📅 2025-07-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses unsupervised single-image Semantic Scene Completion (SSC), i.e., jointly inferring dense 3D geometry and semantics from a single RGB image without any 3D or 2D ground-truth annotations. To this end, we propose SceneDINO—the first SSC framework integrating self-supervised visual representation learning (based on DINO) with multi-view consistency constraints for fully self-supervised training. We further design a 3D feature distillation mechanism to transfer unsupervised 2D semantic knowledge into 3D space, enabling efficient single-pass feedforward inference. 3D feature quality is evaluated via linear probing. Experiments demonstrate that SceneDINO achieves state-of-the-art segmentation accuracy on multiple unsupervised 3D/2D scene understanding benchmarks. Its learned 3D features attain performance on par with supervised methods under linear probing, while exhibiting strong domain generalization and multi-view consistency.

Technology Category

Application Category

📝 Abstract
Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.
Problem

Research questions and friction points this paper is trying to address.

Infer 3D geometry and semantics from single images without supervision
Achieve unsupervised 3D semantic segmentation using self-supervised learning
Improve domain generalization and multi-view consistency in scene understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses self-supervised learning for SSC
Feed-forward 3D geometry and DINO features
Novel 3D feature distillation for semantics