CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning

📅 2024-12-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address the lack of joint modeling of cross-modal semantic–geometric complementarity between images and point clouds, and the heavy reliance on large-scale annotated data in multimodal 3D perception, this paper proposes the first unsupervised, differentiable rendering pre-training framework jointly leveraging images and point clouds. Methodologically, we introduce a curvature-guided sparse sampling strategy and a learnable prototype representation mechanism to align high-level image semantics with 3D point cloud geometry within a unified feature space. By integrating EM-based optimization, prototype-swapping prediction loss, and Gram matrix regularization, our approach enables effective cross-modal joint representation learning. Evaluated on NuScenes and Waymo, our method achieves up to 100% performance gain over state-of-the-art pre-training methods on downstream 3D detection and segmentation tasks, while substantially reducing annotation dependency.

Technology Category

Application Category

📝 Abstract

Unsupervised 3D representation learning reduces the burden of labeling multimodal 3D data for fusion perception tasks. Among different pre-training paradigms, differentiable-rendering-based methods have shown most promise. However, existing works separately conduct pre-training for each modalities due to computational costs of processing large point clouds with images. As such, mutual benefit of high-level semantics (from image) and 3D structure (from point cloud) has not been exploited. To address this gap, we propose a joint unsupervised differentiable-rendering-based pre-training method for images and point clouds, termed CLAP, short for Curvature sampLing and leArnable Prototype. Specifically, our method overcomes the computational hurdle by Curvature Sampling to select the more informative points/pixels for pre-training. To uncover the performance benefits brought by their complementarity, we propose to use learnable prototypes to represent parts of the 3D scenes in a common feature space and an Expectation-Maximization training scheme to associate embeddings of each modality to prototypes. We further propose a swapping prediction loss that explores their interplay through prototypes along with a Gram Matrix Regularization term to maintain training stability. Experiments on NuScenes and Waymo datasets show that CLAP achieves up to 100% more performance gain as compared to previous SOTA pre-training methods. Codes and models will be released.

Problem

Research questions and friction points this paper is trying to address.

Unsupervised 3D representation learning for multimodal fusion perception.

Joint pre-training of images and point clouds using curvature sampling.

Enhancing 3D perception via learnable prototypes and EM training.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint unsupervised differentiable-rendering-based pre-training for images and point clouds

Curvature Sampling selects informative points/pixels for pre-training

Learnable prototypes represent 3D scenes in a common feature space

🔎 Similar Papers

Unsupervised Discovery of Object-Centric Neural Fields