Weakly-Supervised Learning of Dense Functional Correspondences

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the problem of dense functional correspondence modeling across object categories. We propose a weakly supervised learning framework that (1) leverages vision-language models (VLMs) for the first time to automatically generate part-level functional pseudo-labels from multi-view images, and constructs a joint synthetic–real benchmark for evaluation; (2) employs pixel-wise dense contrastive learning to jointly encode functional semantics and spatial-geometric knowledge; and (3) incorporates knowledge distillation to enhance generalization. Experiments demonstrate that our method significantly outperforms existing self-supervised image representation and grounded vision-language models on cross-category functional matching tasks. The results validate both the effectiveness and generalizability of function-guided correspondence learning.

Technology Category

Application Category

📝 Abstract

Establishing dense correspondences across image pairs is essential for tasks such as shape reconstruction and robot manipulation. In the challenging setting of matching across different categories, the function of an object, i.e., the effect that an object can cause on other objects, can guide how correspondences should be established. This is because object parts that enable specific functions often share similarities in shape and appearance. We derive the definition of dense functional correspondence based on this observation and propose a weakly-supervised learning paradigm to tackle the prediction task. The main insight behind our approach is that we can leverage vision-language models to pseudo-label multi-view images to obtain functional parts. We then integrate this with dense contrastive learning from pixel correspondences to distill both functional and spatial knowledge into a new model that can establish dense functional correspondence. Further, we curate synthetic and real evaluation datasets as task benchmarks. Our results demonstrate the advantages of our approach over baseline solutions consisting of off-the-shelf self-supervised image representations and grounded vision language models.

Problem

Research questions and friction points this paper is trying to address.

Learning dense functional correspondences across image pairs

Leveraging vision-language models for pseudo-labeling functional parts

Integrating contrastive learning for functional and spatial knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging vision-language models for pseudo-labeling

Integrating dense contrastive learning for knowledge distillation

Curating synthetic and real datasets for evaluation

🔎 Similar Papers

Relational Representation Learning Network for Cross-Spectral Image Patch Matching