Tango3D: Towards Alignment for Global and Local 2D-3D Correspondence

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
Existing 3D foundation models rely solely on global vector alignment, which struggles to establish fine-grained pixel-to-point correspondences. This work proposes Tango3D, the first 3D foundation model that unifies dense local alignment with global semantic retrieval. By leveraging a geometry-aware 2D vision backbone and a pretrained 3D variational autoencoder (VAE) to encode images and point clouds, respectively, the method maps both modalities into a shared embedding space. A three-stage progressive training strategy enables precise object-level alignment between pixels and 3D points. While maintaining competitive global retrieval performance, Tango3D constructs a fine-grained feature space that supports a variety of dense 3D downstream tasks.
📝 Abstract
Existing 3D foundation models typically align point clouds to frozen vision-language spaces like CLIP, which achieve strong cross-modal retrieval by compressing 3D shape into a global vector. However, this global-only alignment cannot establish fine-grained pixel-to-point correspondence. To solve this, we present Tango3D, a foundation model that unifies dense correspondence and global retrieval. We use a geometry-aware 2D visual backbone and a pretrained 3D VAE to encode images into 2D patches and point clouds into 3D tokens. These are mapped into a single shared space to achieve both local pixel-to-point alignment and global semantic alignment. To stabilize the joint learning of dense and global objectives, we introduce a three-stage progressive training strategy. Experiments show our model successfully achieves object-level pixel-to-point alignment while maintaining competitive global retrieval, a joint capability not offered by existing 3D foundation models. By establishing a fine-grained alignment feature space, Tango3D injects rich semantics into purely geometric 3D tokens, paving the way for a wide range of dense 3D downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

2D-3D correspondence
dense alignment
3D foundation models
pixel-to-point correspondence
global-local alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

2D-3D correspondence
dense alignment
foundation model
progressive training
shared embedding space