🤖 AI Summary
This study investigates the representational potential of foundation models for cross-modal alignment—specifically, whether their unimodal representations inherently capture task-specific semantics and exhibit cross-modal transferability.
Method: We formalize “representational potential” and systematically analyze structural regularities and semantic consistency across vision, language, and speech foundation models. Leveraging cross-modal similarity metrics, representation visualization, and neuroscience-inspired evaluation protocols, we assess generalizability and unification capacity across diverse architectures.
Contribution/Results: Empirical results demonstrate that pretrained foundation models implicitly acquire semantic invariances requisite for cross-modal alignment—even when trained on unimodal data—thereby exhibiting strong potential as unified multimodal representation backbones. Our work establishes a theoretical framework for cross-modal alignment and introduces a reproducible, multi-faceted evaluation paradigm grounded in representational analysis.
📝 Abstract
Foundation models learn highly transferable representations through large-scale pretraining on diverse data. An increasing body of research indicates that these representations exhibit a remarkable degree of similarity across architectures and modalities. In this survey, we investigate the representation potentials of foundation models, defined as the latent capacity of their learned representations to capture task-specific information within a single modality while also providing a transferable basis for alignment and unification across modalities. We begin by reviewing representative foundation models and the key metrics that make alignment measurable. We then synthesize empirical evidence of representation potentials from studies in vision, language, speech, multimodality, and neuroscience. The evidence suggests that foundation models often exhibit structural regularities and semantic consistencies in their representation spaces, positioning them as strong candidates for cross-modal transfer and alignment. We further analyze the key factors that foster representation potentials, discuss open questions, and highlight potential challenges.