SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

📅 2026-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes SOTAlign, a two-stage semi-supervised cross-modal alignment framework designed for settings with only a limited number of image–text paired samples and abundant unpaired data. The approach first constructs a coarse shared embedding space using the scarce paired data via a linear teacher model, then refines alignment on the unpaired data by leveraging optimal transport divergence to transfer relational structures without imposing rigid constraints on the target embedding space. SOTAlign is the first method to effectively exploit large-scale unpaired image–text data in a semi-supervised setting, thereby reducing reliance on extensive labeled datasets. It consistently outperforms existing supervised and semi-supervised baselines across multiple datasets and encoder architectures, demonstrating superior generalization and robustness.

Technology Category

Application Category

📝 Abstract
The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. Unlike existing semi-supervised methods, SOTAlign effectively leverages unpaired images and text, learning robust joint embeddings across datasets and encoder pairs, and significantly outperforming supervised and semi-supervised baselines.
Problem

Research questions and friction points this paper is trying to address.

semi-supervised alignment
vision-language models
optimal transport
unpaired data
cross-modal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-supervised alignment
Optimal transport
Unimodal encoders
Joint embedding
Platonic Representation Hypothesis
S
Simon Roschmann
Helmholtz Munich, Technical University of Munich, Munich Center for Machine Learning, Munich Data Science Institute
P
Paul Krzakala
Télécom Paris, École Polytechnique
S
Sonia Mazelet
École Polytechnique
Quentin Bouniot
Quentin Bouniot
Postdoc at TUM and Helmholtz Munich
deep learningrepresentation learningexplainabilityuncertaintylearning with limited labels
Zeynep Akata
Zeynep Akata
Professor at Technical University of Munich and Director at Helmholtz Munich
Machine LearningVision and LanguageZero-Shot Learning