🤖 AI Summary
To address the challenge of poor representation generalization in 3D foundation model pretraining—stemming from strong data heterogeneity and diverse downstream tasks—this paper proposes the first differentiable neural rendering–driven pretraining paradigm for generic 3D foundation models. Methodologically, we design a unified, homogeneous 3D encoder and integrate differentiable voxel-based neural rendering as a self-supervised signal, enabling joint optimization across hierarchical levels, scenes, and tasks (e.g., detection, segmentation, reconstruction, and image synthesis). We further introduce multi-task collaborative distillation and joint indoor-outdoor domain training. Experiments demonstrate that our framework achieves state-of-the-art performance on 11 mainstream 3D benchmarks, significantly outperforming conventional 2D- and 3D-based pretraining approaches. Moreover, it exhibits strong transferability to 2D backbone pretraining. Code and pretrained models are publicly available.
📝 Abstract
In contrast to numerous NLP and 2D vision foundational models, learning a 3D foundational model poses considerably greater challenges. This is primarily due to the inherent data variability and diversity of downstream tasks. In this paper, we introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation, thereby establishing a pathway to 3D foundational models. Considering that informative 3D features should encode rich geometry and appearance cues that can be utilized to render realistic images, we propose to learn 3D representations by differentiable neural rendering. We train a 3D backbone with a devised volumetric neural renderer by comparing the rendered with the real images. Notably, our approach seamlessly integrates the learned 3D encoder into various downstream tasks. These tasks encompass not only high-level challenges such as 3D detection and segmentation but also low-level objectives like 3D reconstruction and image synthesis, spanning both indoor and outdoor scenarios. Besides, we also illustrate the capability of pre-training a 2D backbone using the proposed methodology, surpassing conventional pre-training methods by a large margin. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness. Code and models are available at https://github.com/OpenGVLab/PonderV2.