It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work investigates unsupervised alignment of vision-language embeddings without parallel data, providing the first systematic validation that cross-modal representations in the large-model era admit feasible unsupervised matching. Methodologically, it formulates embedding alignment as a quadratic assignment problem and introduces a novel heuristic solver grounded in distance-distribution consistency and graph-structural constraints; it further designs an optimal matching construction strategy to enhance robustness under fully blind matching conditions. Experiments across multiple foundation models—including CLIP and SigLIP—on four standard benchmarks demonstrate significant matchability among diverse vision-language encoders. The resulting purely unsupervised classifier achieves non-trivial zero-shot classification accuracy—up to 42.3%—without any image-text annotations. This establishes the first verifiable unsupervised benchmark and methodological framework for cross-modal alignment.

Technology Category

Application Category

📝 Abstract

The platonic representation hypothesis suggests that vision and language embeddings become more homogeneous as model and dataset sizes increase. In particular, pairwise distances within each modality become more similar. This suggests that as foundation models mature, it may become possible to match vision and language embeddings in a fully unsupervised fashion, i.e. without parallel data. We present the first feasibility study, and investigate conformity of existing vision and language foundation models in the context of unsupervised, or"blind", matching. First, we formulate unsupervised matching as a quadratic assignment problem and introduce a novel heuristic that outperforms previous solvers. We also develop a technique to find optimal matching problems, for which a non-trivial match is very likely. Second, we conduct an extensive study deploying a range of vision and language models on four datasets. Our analysis reveals that for many problem instances, vision and language representations can be indeed matched without supervision. This finding opens up the exciting possibility of embedding semantic knowledge into other modalities virtually annotation-free. As a proof of concept, we showcase an unsupervised classifier, which achieves non-trivial classification accuracy without any image-text annotation.

Problem

Research questions and friction points this paper is trying to address.

Investigates unsupervised matching of vision-language embeddings

Develops heuristic for quadratic assignment problem in matching

Demonstrates feasibility of annotation-free multimodal semantic embedding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised matching via quadratic assignment problem

Heuristic solver outperforms previous methods

Vision-language correspondence without parallel data

🔎 Similar Papers

Law of Vision Representation in MLLMs