Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing Omni-MLLMs suffer from perceptual fragility due to their static fusion architectures, often underperforming single-modality baselines in multimodal joint reasoning. This work proposes the Chain of Modality (CoM) framework, which for the first time enables dynamic switching of multimodal fusion topologies, adaptively selecting among parallel, sequential, or interleaved input structures based on task demands. CoM incorporates dual cognitive pathways—intuitive and deliberative decision-making—to better align model behavior with task requirements. Requiring either no training or only data-efficient supervised fine-tuning, the method leverages dynamic routing and attention topology modulation to consistently and significantly outperform existing static fusion approaches across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

Omni-modal Large Language Models (Omni-MLLMs) promise a unified integration of diverse sensory streams. However, recent evaluations reveal a critical performance paradox: unimodal baselines frequently outperform joint multimodal inference. We trace this perceptual fragility to the static fusion topologies universally employed by current models, identifying two structural pathologies: positional bias in sequential inputs and alignment traps in interleaved formats, which systematically distort attention regardless of task semantics. To resolve this functional rigidity, we propose Chain of Modality (CoM), an agentic framework that transitions multimodal fusion from passive concatenation to dynamic orchestration. CoM adaptively orchestrates input topologies, switching among parallel, sequential, and interleaved pathways to neutralize structural biases. Furthermore, CoM bifurcates cognitive execution into two task-aligned pathways: a streamlined ``Direct-Decide'' path for direct perception and a structured ``Reason-Decide'' path for analytical auditing. Operating in either a training-free or a data-efficient SFT setting, CoM achieves robust and consistent generalization across diverse benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Omni-modal Large Language Models

multimodal fusion

static fusion

perceptual fragility

attention distortion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain of Modality

Dynamic Orchestration

Multimodal Fusion