Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

To address the limited generalizability and evaluability of 3D self-supervised features, this paper proposes Masked Scene Modeling (MSM), the first framework enabling linear probing on offline features to match fully supervised performance. Methodologically, MSM introduces a hierarchical 3D masking objective tailored to multi-scale architectures, integrating multi-resolution feature sampling with depth-aware feature masking and reconstruction. Furthermore, it establishes the first semantic-capability evaluation protocol for point-level 3D representations, featuring multi-scale, fine-grained benchmarks. On standard benchmarks including ScanObjectNN, MSM’s linear probe achieves classification accuracy exceeding prior self-supervised methods by over 12% and attains parity with fully supervised models. This demonstrates substantial improvements in both the transferability and practical utility of 3D self-supervised features, advancing the state of unsupervised representation learning in 3D vision.

Technology Category

Application Category

📝 Abstract

Self-supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene understanding, self-supervised methods are typically only used as a weight initialization step for task-specific fine-tuning, limiting their utility for general-purpose feature extraction. This paper addresses this shortcoming by proposing a robust evaluation protocol specifically designed to assess the quality of self-supervised features for 3D scene understanding. Our protocol uses multi-resolution feature sampling of hierarchical models to create rich point-level representations that capture the semantic capabilities of the model and, hence, are suitable for evaluation with linear probing and nearest-neighbor methods. Furthermore, we introduce the first self-supervised model that performs similarly to supervised models when only off-the-shelf features are used in a linear probing setup. In particular, our model is trained natively in 3D with a novel self-supervised approach based on a Masked Scene Modeling objective, which reconstructs deep features of masked patches in a bottom-up manner and is specifically tailored to hierarchical 3D models. Our experiments not only demonstrate that our method achieves competitive performance to supervised models, but also surpasses existing self-supervised approaches by a large margin. The model and training code can be found at our Github repository (https://github.com/phermosilla/msm).

Problem

Research questions and friction points this paper is trying to address.

Bridging the performance gap between supervised and self-supervised 3D scene understanding

Evaluating self-supervised feature quality for 3D scenes with a new protocol

Introducing Masked Scene Modeling for native 3D self-supervised learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Scene Modeling for 3D self-supervised learning

Multi-resolution feature sampling for rich representations

Bottom-up deep feature reconstruction of masked patches

🔎 Similar Papers

Masked Image Modeling: A Survey