Is Extending Modality The Right Path Towards Omni-Modality?

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work investigates whether modal expansion can yield truly universal omni-modal language models (OLMs), addressing three core questions: (1) Does modal expansion degrade foundational language capabilities? (2) Can model fusion effectively integrate unimodal expert models? (3) Is joint modal expansion superior to sequential expansion? Through modal expansion fine-tuning, multimodal model merging, cross-modal zero-shot transfer, and capability disentanglement analysis, the study systematically identifies three fundamental limitations of modal expansion—first empirically demonstrating that (1) modal expansion significantly impairs linguistic competence; (2) model fusion improves cross-modal generalization but falls short of ideal omni-modality; and (3) sequential expansion outperforms joint expansion in knowledge sharing and generalizability. The work establishes “model fusion” as a novel paradigm for achieving omni-modality and provides critical empirical evidence and methodological insights for building robust, scalable OLMs.

Technology Category

Application Category

📝 Abstract

Omni-modal language models (OLMs) aim to integrate and reason over diverse input modalities--such as text, images, video, and audio--while maintaining strong language capabilities. Despite recent advancements, existing models, especially open-source ones, remain far from true omni-modality, struggling to generalize beyond the specific modality pairs they are trained on or to achieve strong performance when processing multi-modal inputs. We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data. Specifically, we investigate three key questions: (1) Does modality extension compromise core language abilities? (2) Can model merging effectively integrate independently fine-tuned modality-specific models to achieve omni-modality? (3) Does omni-modality extension lead to better knowledge sharing and generalization compared to sequential extension? Through extensive experiments, we analyze these trade-offs and provide insights into the feasibility of achieving true omni-modality using current approaches.

Problem

Research questions and friction points this paper is trying to address.

Assessing if modality extension compromises core language abilities

Exploring model merging for integrating modality-specific models into omni-modality

Evaluating omni-modality extension's impact on knowledge sharing and generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extending modality via fine-tuning language models

Merging independently fine-tuned modality-specific models

Analyzing knowledge sharing in omni-modality extension

🔎 Similar Papers

No similar papers found.