Point Cloud Self-supervised Learning via 3D to Multi-view Masked Autoencoder

📅 2023-11-17

🏛️ arXiv.org

📈 Citations: 9

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Existing 3D multimodal masked autoencoders (MAEs) require joint input of 2D images and 3D point clouds; however, point clouds inherently encode multi-view geometric information, making explicit 2D supervision not only inefficient but also detrimental to pure 3D geometric representation learning. To address this, we propose 3D-to-Multi-View MAE: a self-supervised framework that takes only masked point clouds as input, encodes them via a point cloud Transformer, and employs a cross-modal decoder to jointly reconstruct the original point cloud and multi-pose depth maps—achieving, for the first time, end-to-end 3D-only encoding for multi-view depth rendering. Our method deeply integrates geometric structure understanding with cross-view semantic alignment, eliminating reliance on 2D imagery. Extensive experiments demonstrate state-of-the-art performance across downstream tasks—including classification, few-shot learning, part segmentation, and object detection—with consistent gains exceeding 5% on multiple metrics.

📝 Abstract

In recent years, the field of 3D self-supervised learning has witnessed significant progress, resulting in the emergence of Multi-Modality Masked AutoEncoders (MAE) methods that leverage both 2D images and 3D point clouds for pre-training. However, a notable limitation of these approaches is that they do not fully utilize the multi-view attributes inherent in 3D point clouds, which is crucial for a deeper understanding of 3D structures. Building upon this insight, we introduce a novel approach employing a 3D to multi-view masked autoencoder to fully harness the multi-modal attributes of 3D point clouds. To be specific, our method uses the encoded tokens from 3D masked point clouds to generate original point clouds and multi-view depth images across various poses. This approach not only enriches the model's comprehension of geometric structures but also leverages the inherent multi-modal properties of point clouds. Our experiments illustrate the effectiveness of the proposed method for different tasks and under different settings. Remarkably, our method outperforms state-of-the-art counterparts by a large margin in a variety of downstream tasks, including 3D object classification, few-shot learning, part segmentation, and 3D object detection. Code will be available at: https://github.com/Zhimin-C/Multiview-MAE

Problem

Research questions and friction points this paper is trying to address.

Inefficient use of 2D and 3D modalities in self-supervised learning

Reconstruction learning overly relies on visible 2D information

Need for better 3D geometric representation learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D to Multi-View Learner for point clouds

Multi-scale multi-head attention mechanism

Two-stage self-training for 2D-3D alignment

🔎 Similar Papers

GeoMask3D: Geometrically Informed Mask Selection for Self-Supervised Point Cloud Learning in 3D