Generative-Contrastive Learning for Self-Supervised Latent Representations of 3D Shapes from Multi-Modal Euclidean Input

📅 2023-01-11
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Contrastive learning in self-supervised 3D voxel shape representation learning often suffers from latent representation collapse. Method: We propose a generative-contrastive joint learning framework featuring a dual-branch encoder—processing voxels and multi-view images separately—coupled with a shared decoder and a switching training mechanism. To mitigate representation degradation, we introduce randomized stop-gradient operations. The framework jointly optimizes cross-modal contrastive loss and voxel reconstruction loss to achieve implicit feature alignment. Results: Experiments demonstrate substantial improvements over pure contrastive baselines on downstream classification and reconstruction tasks. Our approach effectively alleviates representation collapse, enhancing both discriminability and geometric fidelity of multimodal representations. It establishes a scalable, bi-modal collaborative paradigm for 3D self-supervised learning.
📝 Abstract
We propose a combined generative and contrastive neural architecture for learning latent representations of 3D volumetric shapes. The architecture uses two encoder branches for voxel grids and multi-view images from the same underlying shape. The main idea is to combine a contrastive loss between the resulting latent representations with an additional reconstruction loss. That helps to avoid collapsing the latent representations as a trivial solution for minimizing the contrastive loss. A novel switching scheme is used to cross-train two encoders with a shared decoder. The switching scheme also enables the stop gradient operation on a random branch. Further classification experiments show that the latent representations learned with our self-supervised method integrate more useful information from the additional input data implicitly, thus leading to better reconstruction and classification performance.
Problem

Research questions and friction points this paper is trying to address.

Learning latent representations of 3D volumetric shapes
Avoiding trivial solutions in contrastive loss minimization
Integrating multi-modal input for better reconstruction and classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines generative and contrastive neural architecture
Uses dynamic switching for cross-training encoders
Employs stop gradient on random branch
🔎 Similar Papers
No similar papers found.
Chengzhi Wu
Chengzhi Wu
Vision and Fusion Laboratory (IES), Karlsruhe Institue of Technology
computer vision
Julius Pfrommer
Julius Pfrommer
Head of Department, Fraunhofer IOSB
AutomationOptimizationMachine LearningIndustrie 4.0
M
Mingyuan Zhou
Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Germany
J
J. Beyerer
Fraunhofer Institute of Optronics, System Technologies and Image Exploitation IOSB, Germany