Weakly-Supervised Learning of Dense Functional Correspondences

๐Ÿ“… 2025-09-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the problem of dense functional correspondence modeling across object categories. We propose a weakly supervised learning framework that (1) leverages vision-language models (VLMs) for the first time to automatically generate part-level functional pseudo-labels from multi-view images, and constructs a joint syntheticโ€“real benchmark for evaluation; (2) employs pixel-wise dense contrastive learning to jointly encode functional semantics and spatial-geometric knowledge; and (3) incorporates knowledge distillation to enhance generalization. Experiments demonstrate that our method significantly outperforms existing self-supervised image representation and grounded vision-language models on cross-category functional matching tasks. The results validate both the effectiveness and generalizability of function-guided correspondence learning.

Technology Category

Application Category

๐Ÿ“ Abstract
Establishing dense correspondences across image pairs is essential for tasks such as shape reconstruction and robot manipulation. In the challenging setting of matching across different categories, the function of an object, i.e., the effect that an object can cause on other objects, can guide how correspondences should be established. This is because object parts that enable specific functions often share similarities in shape and appearance. We derive the definition of dense functional correspondence based on this observation and propose a weakly-supervised learning paradigm to tackle the prediction task. The main insight behind our approach is that we can leverage vision-language models to pseudo-label multi-view images to obtain functional parts. We then integrate this with dense contrastive learning from pixel correspondences to distill both functional and spatial knowledge into a new model that can establish dense functional correspondence. Further, we curate synthetic and real evaluation datasets as task benchmarks. Our results demonstrate the advantages of our approach over baseline solutions consisting of off-the-shelf self-supervised image representations and grounded vision language models.
Problem

Research questions and friction points this paper is trying to address.

Learning dense functional correspondences across image pairs
Leveraging vision-language models for pseudo-labeling functional parts
Integrating contrastive learning for functional and spatial knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging vision-language models for pseudo-labeling
Integrating dense contrastive learning for knowledge distillation
Curating synthetic and real datasets for evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.
Stefan Stojanov
Stefan Stojanov
Postdoc at Stanford Vision Lab and Neuro AI Lab
Computer VisionMachine Learning
L
Linan Zhao
Stanford University
Yunzhi Zhang
Yunzhi Zhang
Stanford University
Computer VisionReinforcement Learning
D
Daniel L. K. Yamins
Stanford University
J
Jiajun Wu
Stanford University