🤖 AI Summary
While multimodal large language models (MLLMs) are advancing toward a unified “omni-modal” paradigm integrating vision, audio, and language, the relationship between unimodal capability and omni-modal performance remains poorly understood.
Method: We introduce MMAO-Bench, a high-quality, unified multimodal evaluation benchmark comprising 1,880 samples across 44 diverse tasks, featuring a novel design of multi-step open-ended questions to rigorously assess cross-modal complex reasoning.
Contribution/Results: Empirical analysis reveals a bottleneck effect in weak models—where omni-modal performance saturates despite unimodal improvements—contrasted with cross-modal synergistic gains in strong models. This provides the first quantitative validation of a nonlinear composition law governing unimodal-to-omni-modal capability transfer. MMAO-Bench and its analytical framework establish a new standard for quantifiable, fine-grained evaluation and principled advancement of multimodal intelligence.
📝 Abstract
Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model's intelligence evolution. In this work, we propose a novel, high quality and diversity omni model benchmark, MultiModal All in One Benchmark (MMAO-Bench), which effectively assesses both uni-modal and omni-modal understanding capabilities. The benchmark consists of 1880 human curated samples, across 44 task types, and a innovative multi-step open-ended question type that better assess complex reasoning tasks. Experimental result shows the compositional law between cross-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models.