Tango3D: Towards Alignment for Global and Local 2D-3D Correspondence

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Existing 3D foundation models rely solely on global vector alignment, which struggles to establish fine-grained pixel-to-point correspondences. This work proposes Tango3D, the first 3D foundation model that unifies dense local alignment with global semantic retrieval. By leveraging a geometry-aware 2D vision backbone and a pretrained 3D variational autoencoder (VAE) to encode images and point clouds, respectively, the method maps both modalities into a shared embedding space. A three-stage progressive training strategy enables precise object-level alignment between pixels and 3D points. While maintaining competitive global retrieval performance, Tango3D constructs a fine-grained feature space that supports a variety of dense 3D downstream tasks.

📝 Abstract

Existing 3D foundation models typically align point clouds to frozen vision-language spaces like CLIP, which achieve strong cross-modal retrieval by compressing 3D shape into a global vector. However, this global-only alignment cannot establish fine-grained pixel-to-point correspondence. To solve this, we present Tango3D, a foundation model that unifies dense correspondence and global retrieval. We use a geometry-aware 2D visual backbone and a pretrained 3D VAE to encode images into 2D patches and point clouds into 3D tokens. These are mapped into a single shared space to achieve both local pixel-to-point alignment and global semantic alignment. To stabilize the joint learning of dense and global objectives, we introduce a three-stage progressive training strategy. Experiments show our model successfully achieves object-level pixel-to-point alignment while maintaining competitive global retrieval, a joint capability not offered by existing 3D foundation models. By establishing a fine-grained alignment feature space, Tango3D injects rich semantics into purely geometric 3D tokens, paving the way for a wide range of dense 3D downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

2D-3D correspondence

dense alignment

3D foundation models

pixel-to-point correspondence

global-local alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

2D-3D correspondence

dense alignment

foundation model