🤖 AI Summary
To address the scarcity of annotated 3D point clouds, high acquisition costs, and copyright restrictions on real-world scans, this paper proposes a self-supervised representation learning paradigm that requires neither manual annotations nor real-world data. Instead, it pretrains models exclusively on procedurally generated, semantics-free 3D shapes—constructed from elementary geometric primitives and rigid transformations. We provide the first theoretical and empirical evidence that purely procedural data suffices to learn geometric representations with strong generalization capability, matching the performance of methods trained on real semantic models (e.g., airplanes, chairs). Our approach integrates procedural modeling, contrastive learning, PointNet++-based point cloud encoders, and masked reconstruction pretraining. It achieves state-of-the-art results on downstream tasks including shape classification, part segmentation, and point cloud completion. These findings reveal that current self-supervised 3D learning fundamentally captures low-level geometric structure rather than high-level semantics.
📝 Abstract
Self-supervised learning has emerged as a promising approach for acquiring transferable 3D representations from unlabeled 3D point clouds. Unlike 2D images, which are widely accessible, acquiring 3D assets requires specialized expertise or professional 3D scanning equipment, making it difficult to scale and raising copyright concerns. To address these challenges, we propose learning 3D representations from procedural 3D programs that automatically generate 3D shapes using simple primitives and augmentations. Remarkably, despite lacking semantic content, the 3D representations learned from this synthesized dataset perform on par with state-of-the-art representations learned from semantically recognizable 3D models (e.g., airplanes) across various downstream 3D tasks, including shape classification, part segmentation, and masked point cloud completion. Our analysis further suggests that current self-supervised learning methods primarily capture geometric structures rather than high-level semantics.