Point Cloud Self-supervised Learning via 3D to Multi-view Masked Autoencoder

📅 2023-11-17
🏛️ arXiv.org
📈 Citations: 9
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D multimodal masked autoencoders (MAEs) require joint input of 2D images and 3D point clouds; however, point clouds inherently encode multi-view geometric information, making explicit 2D supervision not only inefficient but also detrimental to pure 3D geometric representation learning. To address this, we propose 3D-to-Multi-View MAE: a self-supervised framework that takes only masked point clouds as input, encodes them via a point cloud Transformer, and employs a cross-modal decoder to jointly reconstruct the original point cloud and multi-pose depth maps—achieving, for the first time, end-to-end 3D-only encoding for multi-view depth rendering. Our method deeply integrates geometric structure understanding with cross-view semantic alignment, eliminating reliance on 2D imagery. Extensive experiments demonstrate state-of-the-art performance across downstream tasks—including classification, few-shot learning, part segmentation, and object detection—with consistent gains exceeding 5% on multiple metrics.
📝 Abstract
In recent years, the field of 3D self-supervised learning has witnessed significant progress, resulting in the emergence of Multi-Modality Masked AutoEncoders (MAE) methods that leverage both 2D images and 3D point clouds for pre-training. However, a notable limitation of these approaches is that they do not fully utilize the multi-view attributes inherent in 3D point clouds, which is crucial for a deeper understanding of 3D structures. Building upon this insight, we introduce a novel approach employing a 3D to multi-view masked autoencoder to fully harness the multi-modal attributes of 3D point clouds. To be specific, our method uses the encoded tokens from 3D masked point clouds to generate original point clouds and multi-view depth images across various poses. This approach not only enriches the model's comprehension of geometric structures but also leverages the inherent multi-modal properties of point clouds. Our experiments illustrate the effectiveness of the proposed method for different tasks and under different settings. Remarkably, our method outperforms state-of-the-art counterparts by a large margin in a variety of downstream tasks, including 3D object classification, few-shot learning, part segmentation, and 3D object detection. Code will be available at: https://github.com/Zhimin-C/Multiview-MAE
Problem

Research questions and friction points this paper is trying to address.

Inefficient use of 2D and 3D modalities in self-supervised learning
Reconstruction learning overly relies on visible 2D information
Need for better 3D geometric representation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D to Multi-View Learner for point clouds
Multi-scale multi-head attention mechanism
Two-stage self-training for 2D-3D alignment
🔎 Similar Papers
No similar papers found.
Z
Zhimin Chen
Clemson University
Yingwei Li
Yingwei Li
Research Scientist, Waymo
Computer Vision
Longlong Jing
Longlong Jing
The City University of New York
L
Liang Yang
The City University of New York
B
Bing Li
Clemson University