P3P: Pseudo-3D Pre-training for Scaling 3D Masked Autoencoders

📅 2024-08-19
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
To address two key bottlenecks in 3D perception—scarcity of real point clouds hindering scalable pretraining, and high time complexity (e.g., O(n²) for k-NN) of existing token embedding methods—this paper proposes a hybrid self-supervised pretraining framework. First, it introduces a novel pseudo-point-cloud generation strategy leveraging large-scale depth estimation models to synthesize high-fidelity 3D data. Second, it designs a learnable token embedding with linear time complexity O(n), eliminating computationally expensive geometric neighborhood search. Third, it formulates a 3D masked autoencoder with 2D image reconstruction as the proxy task. Evaluated on benchmarks including ModelNet40, the method achieves state-of-the-art performance in 3D classification and few-shot learning. It accelerates pretraining by 3.2× and speeds up downstream fine-tuning convergence by 41%, simultaneously enhancing both data and computational efficiency.

Technology Category

Application Category

📝 Abstract
3D pre-training is crucial to 3D perception tasks. However, limited by the difficulties in collecting clean 3D data, 3D pre-training consistently faced data scaling challenges. Inspired by semi-supervised learning leveraging limited labeled data and a large amount of unlabeled data, in this work, we propose a novel self-supervised pre-training framework utilizing the real 3D data and the pseudo-3D data lifted from images by a large depth estimation model. Another challenge lies in the efficiency. Previous methods such as Point-BERT and Point-MAE, employ k nearest neighbors to embed 3D tokens, requiring quadratic time complexity. To efficiently pre-train on such a large amount of data, we propose a linear-time-complexity token embedding strategy and a training-efficient 2D reconstruction target. Our method achieves state-of-the-art performance in 3D classification and few-shot learning while maintaining high pre-training and downstream fine-tuning efficiency.
Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of clean 3D data for pre-training.
Improves efficiency of 3D token embedding with linear complexity.
Enhances 3D perception tasks using pseudo-3D data.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised pre-training with pseudo-3D data
Linear time complexity token embedding strategy
Efficient 2D reconstruction target for training