Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit significant performance volatility in cross-modal reasoning, with conflicting conclusions in prior work regarding the impact of additional modalities—primarily due to the absence of controlled evaluation frameworks and mechanistic analysis. Method: We propose the first logic-driven multimodal reasoning evaluation framework, systematically categorizing six modal interaction patterns. We conduct attention pattern analysis, modality identity recoverability assessment, and soft-attention ablation to dissect underlying mechanisms. Contribution/Results: We identify “task composition” and “fusion mechanism” as dual bottlenecks, demonstrating that integration design—not perceptual capability—is the fundamental performance limiter. Performance improves only when auxiliary modalities provide independent and sufficient reasoning paths. We characterize three systematic degradation patterns and validate the efficacy of composition-aware training. Our findings provide both theoretical foundations and practical guidelines for MLLM architecture design and training paradigms.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio. Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance. These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning. We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined. Empirically, additional modalities enhance reasoning only when they provide independent and sufficient reasoning paths, while redundant or chained entailment support often hurts performance. Moreover, reasoning degrades in three systematic ways: weaker modalities drag down overall performance, conflicts bias preference toward certain modalities, and joint signals from different modalities fail to be integrated effectively. Therefore, we identify two core failures: task-composition bottleneck, where recognition and reasoning cannot be jointly executed in one pass, and fusion bottleneck, where early integration introduces bias. For further investigation, we find that attention patterns fail to encode fact usefulness, but a simple two-step prompting (recognize then reason) restores performance, confirming the task-composition bottleneck. Moreover, modality identity remains recoverable in early layers, and softening attention in early fusion improves reasoning, highlighting biased fusion as another failure mode. Overall, our findings show that integration, not perception, is the main barrier to multimodal reasoning, suggesting composition-aware training and early fusion control as promising directions.
Problem

Research questions and friction points this paper is trying to address.

Evaluating inconsistent cross-modal reasoning performance in multimodal models
Identifying task-composition and fusion bottlenecks in multimodal integration
Analyzing when additional modalities help or harm logical reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Logic-grounded framework categorizes multimodal reasoning patterns
Two-step prompting restores performance via recognize-then-reason
Softening early fusion attention improves reasoning integration
🔎 Similar Papers
No similar papers found.