Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

πŸ“… 2024-11-29
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Vision Transformers (ViTs) exhibit limited capability in understanding 3D spatial relationships, hindering their performance on geometrically grounded downstream tasks. Method: We propose a lightweight, architecture-agnostic multi-view equivariance enhancement method that (1) enforces geometric consistency via 3D correspondence-based constraints; (2) employs a contrastive fine-tuning strategy achieving significant improvement in cross-view equivariance of 3D semantic embeddings within a single object and a single training iteration; and (3) introduces equivariance regularization in feature space. Contribution/Results: We are the first to systematically demonstrate a strong correlation between multi-view semantic embedding equivariance and performance on pose estimation, tracking, and cross-view semantic transfer. Experiments show substantial gains across diverse 3D perception tasksβ€”without requiring large-scale annotations or model architectural modifications.

Technology Category

Application Category

πŸ“ Abstract
Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear. In this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equivariant features, specifically examining the consistency of semantic embeddings across different viewpoints. Our findings indicate that improved 3D equivariance leads to better performance on various downstream tasks, including pose estimation, tracking, and semantic transfer. Building on this insight, we propose a simple yet effective finetuning strategy based on 3D correspondences, which significantly enhances the 3D correspondence understanding of existing vision models. Remarkably, finetuning on a single object for one iteration results in substantial gains. Our code is available at https://github.com/qq456cvb/3DCorrEnhance.
Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D awareness in ViT models
Improving 3D equivariant feature learning
Minimal finetuning for better 3D correspondence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhance 3D awareness via ViT
Finetune with 3D correspondence strategy
Improve 3D equivariance for better performance
πŸ”Ž Similar Papers
No similar papers found.
Yang You
Yang You
Postdoc, Stanford University
3D visioncomputer graphicscomputational geometry
Yixin Li
Yixin Li
Stony Brook University
PET InstrumentMedical ImagingX-ray Imaging
Congyue Deng
Congyue Deng
PhD student, Stanford University
Y
Yue Wang
Department of Computer Science, University of Southern California, U.S.A.
L
Leonidas J. Guibas
Department of Computer Science, Stanford University, U.S.A.