MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

📅 2024-06-29
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal benchmarks suffer from systematic vision-irrelevant biases, enabling vision-unaware large language models (LLMs) to achieve spuriously high scores and severely undermining evaluation validity. To address this, we propose MMEvalPro—a novel benchmark introducing the first “perception question–knowledge-anchor question–original question” triplet evaluation paradigm. It integrates human-in-the-loop annotation, a three-stage evaluation pipeline, and cross-benchmark data fusion (MMMU/ScienceQA/MathVista) to rigorously eliminate non-visual cues in multiple-choice questions (MCQs). MMEvalPro comprises 2,138 triplets (6,414 questions), with two-thirds annotated by domain experts. Experiments reveal that multimodal models (LMMs) underperform humans by 31.73% (vs. only 8.03% on prior benchmarks), and the best LLM lags the best LMM by 23.09% (vs. only 14.64% previously)—substantially widening performance gaps, reducing Type-I errors, and enhancing both challenge level and assessment reliability.

Technology Category

Application Category

📝 Abstract
Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises $2,138$ question triplets, totaling $6,414$ distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging (the best LMM lags behind human performance by $31.73%$, compared to an average gap of $8.03%$ in previous benchmarks) and more trustworthy (the best LLM trails the best LMM by $23.09%$, whereas the gap for previous benchmarks is just $14.64%$). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research.
Problem

Research questions and friction points this paper is trying to address.

Addressing systematic biases in multimodal benchmark evaluations
Ensuring trustworthy assessment of Large Multimodal Models' capabilities
Preventing Type-I errors through enhanced evaluation pipeline
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-annotated perception and knowledge anchor questions
Trilogy evaluation pipeline to prevent Type-I errors
Rigorous metrics for trustworthy multimodal benchmark evaluation
🔎 Similar Papers
No similar papers found.
Jinsheng Huang
Jinsheng Huang
Peking University
Multimodal LearningFintech
L
Liang Chen
National Key Laboratory for Multimedia Information Processing, Peking University
Taian Guo
Taian Guo
Peking university
LLM for financetime series forecastingquantitative trading
F
Fu Zeng
Chinese Academy of Medical Sciences
Y
Yusheng Zhao
National Key Laboratory for Multimedia Information Processing, Peking University
B
Bohan Wu
National Key Laboratory for Multimedia Information Processing, Peking University
Y
Ye Yuan
National Key Laboratory for Multimedia Information Processing, Peking University
H
Haozhe Zhao
National Key Laboratory for Multimedia Information Processing, Peking University
Zhihui Guo
Zhihui Guo
CUHK
Y
Yichi Zhang
National Key Laboratory for Multimedia Information Processing, Peking University
Jingyang Yuan
Jingyang Yuan
Peking University
LLMAI for Science
W
Wei Ju
National Key Laboratory for Multimedia Information Processing, Peking University
L
Luchen Liu
National Key Laboratory for Multimedia Information Processing, Peking University
T
Tianyu Liu
Alibaba Group
B
Baobao Chang
National Key Laboratory for Multimedia Information Processing, Peking University
M
Ming Zhang
National Key Laboratory for Multimedia Information Processing, Peking University