Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This study investigates the long-overlooked modality preference in natively trained multimodal large language models (OLLMs), challenging the prevailing assumption of text dominance in conventional vision-language models. By constructing a conflict-based multimodal benchmark and introducing a modality selection rate metric—combined with inter-layer representation probing—the work reveals, for the first time, a prevalent visual preference in OLLMs that progressively emerges in middle-to-late network layers. Leveraging internal representational signals, the authors propose a novel cross-modal hallucination diagnostic paradigm that requires no task-specific data. The method achieves competitive performance across three downstream benchmarks, offering both mechanistic insights and practical tools to enhance the reliability and interpretability of OLLMs.

Technology Category

Application Category

📝 Abstract

Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip-cas/OmniPreference

Problem

Research questions and friction points this paper is trying to address.

modality preference

omni-modal large language models

visual preference

cross-modal hallucinations

unified representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

modality preference

omni-modal large language models

cross-modal hallucination