Escaping Platos Cave: JAM for Aligning Independently Trained Vision and Language Models

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the misalignment between independently trained vision and language model representation spaces, this paper proposes the Joint Autoencoder Modulator (JAM) framework, which achieves cross-modal representation alignment without updating the original model parameters. JAM formalizes the Platonic representation hypothesis as a multi-objective optimization problem, integrating modality-specific autoencoders, contrastive loss (Con), a hard-negative variant (NegCon), a newly introduced Spread loss, and joint cross-modal reconstruction. Experiments demonstrate that JAM consistently achieves efficient, Pareto-optimal alignment across diverse model scales, architectures, and pretraining objectives. It significantly improves performance on cross-modal retrieval and understanding tasks. By enabling lightweight, general-purpose, and task-specializable multimodal systems without modifying pretrained models, JAM establishes a novel paradigm for modular, parameter-efficient multimodal alignment.

Technology Category

Application Category

📝 Abstract

Independently trained vision and language models inhabit disjoint representational spaces, shaped by their respective modalities, objectives, and architectures. Yet an emerging hypothesis - the Platonic Representation Hypothesis - suggests that such models may nonetheless converge toward a shared statistical model of reality. This compatibility, if it exists, raises a fundamental question: can we move beyond post-hoc statistical detection of alignment and explicitly optimize for it between such disjoint representations? We cast this Platonic alignment problem as a multi-objective optimization task - preserve each modality's native structure while aligning for mutual coherence. We introduce the Joint Autoencoder Modulator (JAM) framework that jointly trains modality-specific autoencoders on the latent representations of pre-trained single modality models, encouraging alignment through both reconstruction and cross-modal objectives. By analogy, this framework serves as a method to escape Plato's Cave, enabling the emergence of shared structure from disjoint inputs. We evaluate this framework across three critical design axes: (i) the alignment objective - comparing contrastive loss (Con), its hard-negative variant (NegCon), and our Spread loss, (ii) the layer depth at which alignment is most effective, and (iii) the impact of foundation model scale on representational convergence. Our findings show that our lightweight Pareto-efficient framework reliably induces alignment, even across frozen, independently trained representations, offering both theoretical insight and practical pathways for transforming generalist unimodal foundations into specialist multimodal models.

Problem

Research questions and friction points this paper is trying to address.

Aligning independently trained vision and language models

Optimizing mutual coherence between disjoint representations

Transforming unimodal foundations into multimodal models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint Autoencoder Modulator aligns vision and language models

Multi-objective optimization preserves modality-specific structures

Lightweight Pareto-efficient framework enables multimodal alignment

🔎 Similar Papers

SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs

2024-08-21arXiv.orgCitations: 9

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4