Self-Supervised Spatial Correspondence Across Modalities

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This paper addresses unsupervised cross-modal pixel-level spatiotemporal correspondence matching—i.e., establishing precise pixel-wise alignments between scene points across heterogeneous visual modalities (e.g., RGB/depth, RGB/thermal, photo/sketch) without registration labels or photometric consistency assumptions. We propose the first contrastive random walk-based joint learning framework that unifies cross-modal and intra-modal cycle consistency constraints, enabling end-to-end self-supervised feature learning and embedding space alignment. Our key contribution is extending contrastive random walks to multimodal cycle-consistent representation learning, eliminating reliance on pre-aligned image pairs. The method achieves state-of-the-art performance on multiple geometric and semantic matching benchmarks, significantly improving unsupervised cross-modal correspondence accuracy.

Technology Category

Application Category

📝 Abstract

We present a method for finding cross-modal space-time correspondences. Given two images from different visual modalities, such as an RGB image and a depth map, our model identifies which pairs of pixels correspond to the same physical points in the scene. To solve this problem, we extend the contrastive random walk framework to simultaneously learn cycle-consistent feature representations for both cross-modal and intra-modal matching. The resulting model is simple and has no explicit photo-consistency assumptions. It can be trained entirely using unlabeled data, without the need for any spatially aligned multimodal image pairs. We evaluate our method on both geometric and semantic correspondence tasks. For geometric matching, we consider challenging tasks such as RGB-to-depth and RGB-to-thermal matching (and vice versa); for semantic matching, we evaluate on photo-sketch and cross-style image alignment. Our method achieves strong performance across all benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Finding cross-modal spatial correspondences between images

Learning feature representations without aligned multimodal data

Matching RGB-depth, RGB-thermal, photo-sketch, and cross-style images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised cross-modal spatial correspondence learning

Contrastive random walk for cycle-consistent features

Training without aligned multimodal image pairs

🔎 Similar Papers

No similar papers found.

Bosch Group

Hildesheim, NDS, DE

Research Intern - Bayesian Methods in Geometric Computer Vision

Microsoft

$6,710 -

San Francisco Bay area / New York City metropolitan area

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)