🤖 AI Summary
Existing indoor 3D semantic occupancy prediction methods rely heavily on precise camera parameters and large-scale, pixel-accurate 3D annotations—making them costly and impractical to scale. Method: This paper introduces the first fully self-supervised framework, trained exclusively on unlabeled, camera-parameter-free indoor internet videos (e.g., YouTube). It distills semantic knowledge from a 2D vision foundation model (VFM) into 3D space via superpixel-guided aggregation, enabling end-to-end learning without geometric priors. Contribution/Results: We release YouTube-Occ, the first large-scale self-supervised indoor occupancy dataset derived from web videos. Our method achieves state-of-the-art zero-shot transfer performance on NYUv2 and OccScanNet—demonstrating, for the first time, that high-fidelity 3D semantic occupancy learning is feasible using only uncurated online video, thereby drastically reducing data acquisition and annotation overhead.
📝 Abstract
3D semantic occupancy prediction in the past was considered to require precise geometric relationships in order to enable effective training. However, in complex indoor environments, the large-scale and widespread collection of data, along with the necessity for fine-grained annotations, becomes impractical due to the complexity of data acquisition setups and privacy concerns. In this paper, we demonstrate that 3D spatially-accurate training can be achieved using only indoor Internet data, without the need for any pre-knowledge of intrinsic or extrinsic camera parameters. In our framework, we collect a web dataset, YouTube-Occ, which comprises house tour videos from YouTube, providing abundant real house scenes for 3D representation learning. Upon on this web dataset, we establish a fully self-supervised model to leverage accessible 2D prior knowledge for reaching powerful 3D indoor perception. Specifically, we harness the advantages of the prosperous vision foundation models, distilling the 2D region-level knowledge into the occupancy network by grouping the similar pixels into superpixels. Experimental results show that our method achieves state-of-the-art zero-shot performance on two popular benchmarks (NYUv2 and OccScanNet