Multimodal LLMs under Pairwise Modalities

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

211K/year
🤖 AI Summary
This work addresses the reliance of conventional multimodal large language models on costly and hard-to-scale fully aligned multimodal data. The authors propose a two-stage framework that trains such models using only pairwise modality data. In the first stage, a shared latent space is constructed through within-modality reconstruction and pairwise contrastive learning. In the second stage, new modality encoders are integrated with a pretrained decoder to enable cross-modal transfer and generation. Theoretical analysis establishes conditions under which aligned representations can be achieved using only pairwise data, introducing inductive biases based on partial alignment and minimal latent norms to eliminate the need for complete joint multimodal observations. The approach successfully incorporates 3D point clouds and tactile modalities into a pretrained model, achieving strong cross-modal performance across three pairs of modalities.
📝 Abstract
Despite the impressive results achieved by multimodal large language models (MLLMs), their training typically relies on jointly curated multimodal data, requiring substantial human effort to construct multi-way aligned datasets and thereby limiting scalability across domains. In this work, we explore training MLLMs by only leveraging multiple paired modalities as a surrogate for the full joint multimodal distribution. Specifically, we first provide a theoretical analysis of the conditions under which the representations are identifiable with only observing pairwise modalities. Building on this analysis, we propose a representation learning framework for aligning latent representations across modalities using only pairwise data. The framework consists of two stages: latent representation alignment and cross-modal recomposition. Specifically, in the first stage, we learn the shared latent space across modalities by both self-modal reconstruction and pair-wise contrastive learning. We also incorporate an inductive bias in the contrastive learning process by partially aligning and minimal latent specification. In stage two, we integrate the encoder of newly introduced modalities with the decoders of the pre-trained modalities to facilitate cross-modal transfer and generation. We evaluate our method by newly adding 3D point clouds and tactile modalities into pre-trained MLLMs with three modality pairs and show that, by learning an aligned latent representation space, our model achieves strong cross-modal performance.
Problem

Research questions and friction points this paper is trying to address.

multimodal LLMs
pairwise modalities
scalability
aligned datasets
representation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

pairwise modalities
latent representation alignment
multimodal LLMs
contrastive learning
cross-modal recomposition
🔎 Similar Papers
No similar papers found.