🤖 AI Summary
Existing benchmarks predominantly evaluate single-modal visual understanding, lacking systematic assessment of complex multimodal reasoning tasks that require joint visual–textual contextual integration. To address this gap, we propose the first zero-shot evaluation framework specifically designed for multimodal reasoning. It encompasses 16 diverse datasets, six task categories, and ten instruction templates, enabling横向 (cross-model) evaluation of 45 models—including 36 multimodal large language models (MLLMs) and 9 unimodal large language models (LLMs). We introduce four novel metrics: *Best Performance*, *Average Relative Gain*, *Stability*, and *Adaptability*, which collectively uncover empirical patterns regarding how model architecture, instruction formatting, and their interaction influence multimodal reasoning capabilities. Furthermore, we open-source a standardized evaluation toolkit and an interactive, real-time leaderboard to foster benchmark standardization and reproducible progress in multimodal evaluation.
📝 Abstract
The emergence of multimodal large language models (MLLMs) has triggered extensive research in model evaluation. While existing evaluation studies primarily focus on unimodal (vision-only) comprehension and reasoning capabilities, they overlook critical assessments of complex multimodal reasoning tasks that require integrated understanding of both visual and textual contexts. Such multimodal tasks present unique challenges, demanding sophisticated reasoning across multiple modalities and deep comprehension of multimodal contexts. In this paper, we present MM-InstructEval, a comprehensive evaluation framework that incorporates diverse metrics to assess model performance across various multimodal reasoning tasks with vision-text contexts. We conduct extensive zero-shot evaluations on 45 models (including 36 MLLMs) across 16 multimodal datasets, encompassing 6 distinct tasks using 10 different instructions. Our framework introduces multiple innovative metrics, including the 'Best Performance' metric to benchmark peak model capabilities, the 'Mean Relative Gain' metric to assess overall efficacy across models and instructions, the 'Stability' metric to measure robustness, and the 'Adaptability' metric to quantify the compatibility between models and instructions. Through comprehensive evaluation and analysis, we uncover several significant insights about model architectures, instruction formats, and their interactions in multimodal reasoning tasks. Our findings establish new benchmarks for assessing the reasoning capabilities of MLLMs and provide strategic guidance for future developments. To facilitate continued research and evaluation in this field, we release our framework and resources at https://github.com/declare-lab/MM-InstructEval, with an interactive leaderboard available at MM-InstructEval Leaderboard (https://declare-lab.github.io/MM-InstructEval/).