The Indra Representation Hypothesis for Multimodal Alignment

πŸ“… 2026-04-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
While existing unimodal foundation models can learn convergent representations, they struggle to explicitly capture the relational structures shared across modalities, thereby limiting alignment and generalization capabilities. Inspired by the Indra's Net philosophical metaphor, this work proposes the Indra Representation Hypothesis: unimodal representations implicitly encode the global relational structure among samples. Leveraging the V-enriched Yoneda embedding from category theory, we formalize this structure and construct a training-free cross-modal alignment framework. The resulting Indra representations exhibit uniqueness, completeness, and structure preservation; combined with angular distance metrics, they enable, for the first time, training-free alignment across vision, language, audio, and other modalities. Extensive experiments demonstrate significantly improved robustness and consistency in diverse scenarios, validating the approach’s effectiveness and universality.
πŸ“ Abstract
Recent studies have uncovered an interesting phenomenon: unimodal foundation models tend to learn convergent representations, regardless of differences in architecture, training objectives, or data modalities. However, these representations are essentially internal abstractions of samples that characterize samples independently, leading to limited expressiveness. In this paper, we propose The Indra Representation Hypothesis, inspired by the philosophical metaphor of Indra's Net. We argue that representations from unimodal foundation models are converging to implicitly reflect a shared relational structure underlying reality, akin to the relational ontology of Indra's Net. We formalize this hypothesis using the V-enriched Yoneda embedding from category theory, defining the Indra representation as a relational profile of each sample with respect to others. This formulation is shown to be unique, complete, and structure-preserving under a given cost function. We instantiate the Indra representation using angular distance and evaluate it in cross-model and cross-modal scenarios involving vision, language, and audio. Extensive experiments demonstrate that Indra representations consistently enhance robustness and alignment across architectures and modalities, providing a theoretically grounded and practical framework for training-free alignment of unimodal foundation models. Our code is available at https://github.com/Jianglin954/Indra.
Problem

Research questions and friction points this paper is trying to address.

multimodal alignment
unimodal foundation models
representation expressiveness
relational structure
cross-modal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Indra representation
relational structure
Yoneda embedding
training-free alignment
multimodal alignment
πŸ”Ž Similar Papers
No similar papers found.
Jianglin Lu
Jianglin Lu
Northeastern University
Machine Learning
H
Hailing Wang
Department of Electrical and Computer Engineering, Northeastern University
K
Kuo Yang
Department of Electrical and Computer Engineering, Northeastern University
Yitian Zhang
Yitian Zhang
Northeastern University
computer vision
Simon Jenni
Simon Jenni
Adobe Research
computer visionmachine learningdeep learningunsupervised learningself-supervised learning
Y
Yun Fu
Department of Electrical and Computer Engineering, Northeastern University; Khoury College of Computer Science, Northeastern University