Scaling Language-Centric Omnimodal Representation Learning

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

While existing multimodal large language models (MLLMs) enhanced with contrastive learning achieve strong performance in multimodal embedding, their underlying mechanisms remain poorly understood. Method: This paper proposes LCO-Emb, a language-centric, full-modality embedding framework that explicitly leverages the implicit cross-modal alignment emergent during MLLM generative pretraining—optimized via lightweight contrastive fine-tuning. Contribution/Results: (1) We discover and theoretically validate the “Generation–Representation Scaling Law” (GRSL), quantifying the intrinsic relationship between generative capability and representational upper bound; (2) we introduce a novel paradigm wherein representation evolution is driven by generative capacity improvement; (3) we systematically characterize embedding structural evolution via anisotropy and kernel similarity analysis. LCO-Emb achieves state-of-the-art performance across multiple backbone architectures and cross-modal benchmarks, and significantly boosts low-resource visual document retrieval accuracy.

Technology Category

Application Category

📝 Abstract

Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.

Problem

Research questions and friction points this paper is trying to address.

Investigating implicit cross-modal alignment in MLLMs

Proposing a Language-Centric Omnimodal Embedding framework

Establishing Generation-Representation Scaling Law for embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-centric framework for multimodal embedding learning

Contrastive learning refines generative pretraining alignment

Scaling law links generative and representation capabilities

🔎 Similar Papers

No similar papers found.