Generative-Contrastive Learning for Self-Supervised Latent Representations of 3D Shapes from Multi-Modal Euclidean Input

📅 2023-01-11

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Contrastive learning in self-supervised 3D voxel shape representation learning often suffers from latent representation collapse. Method: We propose a generative-contrastive joint learning framework featuring a dual-branch encoder—processing voxels and multi-view images separately—coupled with a shared decoder and a switching training mechanism. To mitigate representation degradation, we introduce randomized stop-gradient operations. The framework jointly optimizes cross-modal contrastive loss and voxel reconstruction loss to achieve implicit feature alignment. Results: Experiments demonstrate substantial improvements over pure contrastive baselines on downstream classification and reconstruction tasks. Our approach effectively alleviates representation collapse, enhancing both discriminability and geometric fidelity of multimodal representations. It establishes a scalable, bi-modal collaborative paradigm for 3D self-supervised learning.

📝 Abstract

We propose a combined generative and contrastive neural architecture for learning latent representations of 3D volumetric shapes. The architecture uses two encoder branches for voxel grids and multi-view images from the same underlying shape. The main idea is to combine a contrastive loss between the resulting latent representations with an additional reconstruction loss. That helps to avoid collapsing the latent representations as a trivial solution for minimizing the contrastive loss. A novel switching scheme is used to cross-train two encoders with a shared decoder. The switching scheme also enables the stop gradient operation on a random branch. Further classification experiments show that the latent representations learned with our self-supervised method integrate more useful information from the additional input data implicitly, thus leading to better reconstruction and classification performance.

Problem

Research questions and friction points this paper is trying to address.

Learning latent representations of 3D volumetric shapes

Avoiding trivial solutions in contrastive loss minimization

Integrating multi-modal input for better reconstruction and classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines generative and contrastive neural architecture

Uses dynamic switching for cross-training encoders

Employs stop gradient on random branch

🔎 Similar Papers

No similar papers found.