Canonicalizing Multimodal Contrastive Representation Learning

📅 2026-02-19

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses a critical limitation in existing independently trained multimodal contrastive models—such as CLIP, SigLIP, and FLAVA—which lack explicit alignment of their representation spaces, particularly in image–text coupling consistency. The authors theoretically demonstrate that embedding spaces from image and text encoders, despite being trained under different architectures and data distributions, can be simultaneously aligned via a single orthogonal mapping. Building on this insight, they propose a unified alignment framework integrating orthogonal mapping modeling, multimodal kernel consistency analysis, and anchor set validation. Notably, the method operates without requiring re-embedding, enabling seamless compatibility with pre-trained models. Extensive experiments across multiple established architectures validate its effectiveness, while also offering a novel perspective on privacy-preserving multimodal representation learning.

Technology Category

Application Category

📝 Abstract

As models and data scale, independently trained networks often induce analogous notions of similarity. But, matching similarities is weaker than establishing an explicit correspondence between the representation spaces, especially for multimodal models, where consistency must hold not only within each modality, but also for the learned image-text coupling. We therefore ask: given two independently trained multimodal contrastive models (with encoders $(f, g)$ and $(\widetilde{f},\widetilde{g})$) -- trained on different distributions and with different architectures -- does a systematic geometric relationship exist between their embedding spaces? If so, what form does it take, and does it hold uniformly across modalities? In this work, we show that across model families such as CLIP, SigLIP, and FLAVA, this geometric relationship is well approximated by an orthogonal map (up to a global mean shift), i.e., there exists an orthogonal map $Q$ where $Q^\top Q = I$ such that $\widetilde{f}(x)\approx Q f(x)$ for paired images $x$. Strikingly, the same $Q$ simultaneously aligns the text encoders i.e., $\widetilde{g}(y)\approx Q g(y)$ for texts $y$. Theoretically, we prove that if the multimodal kernel agrees across models on a small anchor set i.e. $\langle f(x), g(y)\rangle \approx \langle \widetilde{f}(x), \widetilde{g}(y)\rangle$, then the two models must be related by a single orthogonal map $Q$ and the same $Q$ maps images and text across models. More broadly, this finding enables backward-compatible model upgrades, avoiding costly re-embedding, and has implications for the privacy of learned representations. Our project page: https://canonical-multimodal.github.io/

Problem

Research questions and friction points this paper is trying to address.

multimodal representation

contrastive learning

embedding alignment

orthogonal transformation

canonicalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal representation alignment

orthogonal transformation

contrastive learning