Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

๐Ÿ“… 2025-10-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses weak semantic consistency and challenging cross-modal alignment in unsupervised 3D representation learning by proposing the first joint self-supervised framework for 2D images and 3D point clouds. Methodologically, it integrates 2Dโ€“3D cross-modal contrastive learning, intra-modal self-distillation on point clouds, video-guided temporal modeling, and linear projection into the CLIP language embedding spaceโ€”mimicking human multimodal cognitive integration. Its core contribution is enabling geometry-semantic joint representation learning without manual annotations, while supporting open-world semantic transfer. On ScanNet, zero-shot linear probing achieves an mIoU of 80.7%, outperforming the best prior 2D and 3D baselines by 14.2% and 4.8%, respectively. Visualizations confirm strong alignment between learned geometric structures and semantic distributions.

Technology Category

Application Category

๐Ÿ“ Abstract
Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.
Problem

Research questions and friction points this paper is trying to address.

Learning spatial representations through joint 2D-3D self-supervised learning
Improving 3D scene perception via multimodal feature coherence
Enabling open-world spatial understanding with geometric-semantic consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint 2D-3D cross-modal embedding for spatial learning
3D intra-modal self-distillation combined with joint embedding
Linear projection into CLIP's language space for open-world perception
๐Ÿ”Ž Similar Papers
No similar papers found.