MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

📅 2024-05-12
🏛️ Information Fusion
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks predominantly evaluate single-modal visual understanding, lacking systematic assessment of complex multimodal reasoning tasks that require joint visual–textual contextual integration. To address this gap, we propose the first zero-shot evaluation framework specifically designed for multimodal reasoning. It encompasses 16 diverse datasets, six task categories, and ten instruction templates, enabling横向 (cross-model) evaluation of 45 models—including 36 multimodal large language models (MLLMs) and 9 unimodal large language models (LLMs). We introduce four novel metrics: *Best Performance*, *Average Relative Gain*, *Stability*, and *Adaptability*, which collectively uncover empirical patterns regarding how model architecture, instruction formatting, and their interaction influence multimodal reasoning capabilities. Furthermore, we open-source a standardized evaluation toolkit and an interactive, real-time leaderboard to foster benchmark standardization and reproducible progress in multimodal evaluation.

Technology Category

Application Category

📝 Abstract
The emergence of multimodal large language models (MLLMs) has triggered extensive research in model evaluation. While existing evaluation studies primarily focus on unimodal (vision-only) comprehension and reasoning capabilities, they overlook critical assessments of complex multimodal reasoning tasks that require integrated understanding of both visual and textual contexts. Such multimodal tasks present unique challenges, demanding sophisticated reasoning across multiple modalities and deep comprehension of multimodal contexts. In this paper, we present MM-InstructEval, a comprehensive evaluation framework that incorporates diverse metrics to assess model performance across various multimodal reasoning tasks with vision-text contexts. We conduct extensive zero-shot evaluations on 45 models (including 36 MLLMs) across 16 multimodal datasets, encompassing 6 distinct tasks using 10 different instructions. Our framework introduces multiple innovative metrics, including the 'Best Performance' metric to benchmark peak model capabilities, the 'Mean Relative Gain' metric to assess overall efficacy across models and instructions, the 'Stability' metric to measure robustness, and the 'Adaptability' metric to quantify the compatibility between models and instructions. Through comprehensive evaluation and analysis, we uncover several significant insights about model architectures, instruction formats, and their interactions in multimodal reasoning tasks. Our findings establish new benchmarks for assessing the reasoning capabilities of MLLMs and provide strategic guidance for future developments. To facilitate continued research and evaluation in this field, we release our framework and resources at https://github.com/declare-lab/MM-InstructEval, with an interactive leaderboard available at MM-InstructEval Leaderboard (https://declare-lab.github.io/MM-InstructEval/).
Problem

Research questions and friction points this paper is trying to address.

Evaluates multimodal reasoning in large language models
Assesses integrated vision-text understanding and performance
Introduces innovative metrics for model robustness and adaptability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive multimodal evaluation framework
Diverse metrics for model assessment
Zero-shot evaluation on multiple tasks
🔎 Similar Papers
No similar papers found.
Xiaocui Yang
Xiaocui Yang
Lecturer, Northeastern University (China)
Multimodal Sentiment AnalysisData MiningMultimodal Large Language Models
Wenfang Wu
Wenfang Wu
Northeastern University
Sentiment analysisKnowledge Graph
S
Shi Feng
School of Computer Science and Engineering, Northeastern University, Shenyang, China
M
Ming Wang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
D
Daling Wang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Y
Yang Li
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Q
Qi Sun
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
Y
Yifei Zhang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Xiaoming Fu
Xiaoming Fu
Professor of Computer Science, University of Goettingen
Networked SystemsCloud ComputingMobile and Edge ComputingSocial ComputingBig Data
S
Soujanya Poria
Singapore University of Technology and Design, Singapore