๐ค AI Summary
This work addresses the problem of dense functional correspondence modeling across object categories. We propose a weakly supervised learning framework that (1) leverages vision-language models (VLMs) for the first time to automatically generate part-level functional pseudo-labels from multi-view images, and constructs a joint syntheticโreal benchmark for evaluation; (2) employs pixel-wise dense contrastive learning to jointly encode functional semantics and spatial-geometric knowledge; and (3) incorporates knowledge distillation to enhance generalization. Experiments demonstrate that our method significantly outperforms existing self-supervised image representation and grounded vision-language models on cross-category functional matching tasks. The results validate both the effectiveness and generalizability of function-guided correspondence learning.
๐ Abstract
Establishing dense correspondences across image pairs is essential for tasks such as shape reconstruction and robot manipulation. In the challenging setting of matching across different categories, the function of an object, i.e., the effect that an object can cause on other objects, can guide how correspondences should be established. This is because object parts that enable specific functions often share similarities in shape and appearance. We derive the definition of dense functional correspondence based on this observation and propose a weakly-supervised learning paradigm to tackle the prediction task. The main insight behind our approach is that we can leverage vision-language models to pseudo-label multi-view images to obtain functional parts. We then integrate this with dense contrastive learning from pixel correspondences to distill both functional and spatial knowledge into a new model that can establish dense functional correspondence. Further, we curate synthetic and real evaluation datasets as task benchmarks. Our results demonstrate the advantages of our approach over baseline solutions consisting of off-the-shelf self-supervised image representations and grounded vision language models.