🤖 AI Summary
Existing 3D multimodal masked autoencoders (MAEs) require joint input of 2D images and 3D point clouds; however, point clouds inherently encode multi-view geometric information, making explicit 2D supervision not only inefficient but also detrimental to pure 3D geometric representation learning. To address this, we propose 3D-to-Multi-View MAE: a self-supervised framework that takes only masked point clouds as input, encodes them via a point cloud Transformer, and employs a cross-modal decoder to jointly reconstruct the original point cloud and multi-pose depth maps—achieving, for the first time, end-to-end 3D-only encoding for multi-view depth rendering. Our method deeply integrates geometric structure understanding with cross-view semantic alignment, eliminating reliance on 2D imagery. Extensive experiments demonstrate state-of-the-art performance across downstream tasks—including classification, few-shot learning, part segmentation, and object detection—with consistent gains exceeding 5% on multiple metrics.
📝 Abstract
In recent years, the field of 3D self-supervised learning has witnessed significant progress, resulting in the emergence of Multi-Modality Masked AutoEncoders (MAE) methods that leverage both 2D images and 3D point clouds for pre-training. However, a notable limitation of these approaches is that they do not fully utilize the multi-view attributes inherent in 3D point clouds, which is crucial for a deeper understanding of 3D structures. Building upon this insight, we introduce a novel approach employing a 3D to multi-view masked autoencoder to fully harness the multi-modal attributes of 3D point clouds. To be specific, our method uses the encoded tokens from 3D masked point clouds to generate original point clouds and multi-view depth images across various poses. This approach not only enriches the model's comprehension of geometric structures but also leverages the inherent multi-modal properties of point clouds. Our experiments illustrate the effectiveness of the proposed method for different tasks and under different settings. Remarkably, our method outperforms state-of-the-art counterparts by a large margin in a variety of downstream tasks, including 3D object classification, few-shot learning, part segmentation, and 3D object detection. Code will be available at: https://github.com/Zhimin-C/Multiview-MAE