🤖 AI Summary
Existing MLLM evaluation methods suffer from high redundancy and low efficiency. To address this, we propose a novel “one-interviewer–multiple-models” interview-style evaluation paradigm, inspired by human recruitment interviews. Our method comprises two stages—pre-interview and formal interview—and integrates dynamic judge-weight adjustment with adaptive question difficulty selection, thereby establishing an efficient and fair evaluation framework. Crucially, we model evaluation as a structured interactive process rather than static question sampling. Extensive experiments across multiple benchmarks demonstrate that our approach achieves strong correlation with full-scale evaluation using only ~30% of the questions: Pearson and Spearman correlation coefficients improve by 17.6% and 16.7%, respectively, significantly outperforming random sampling. This work introduces a principled, interaction-driven paradigm for efficient and reliable MLLM assessment.
📝 Abstract
The rapid progress of Multi-Modal Large Language Models (MLLMs) has spurred the creation of numerous benchmarks. However, conventional full-coverage Question-Answering evaluations suffer from high redundancy and low efficiency. Inspired by human interview processes, we propose a multi-to-one interview paradigm for efficient MLLM evaluation. Our framework consists of (i) a two-stage interview strategy with pre-interview and formal interview phases, (ii) dynamic adjustment of interviewer weights to ensure fairness, and (iii) an adaptive mechanism for question difficulty-level chosen. Experiments on different benchmarks show that the proposed paradigm achieves significantly higher correlation with full-coverage results than random sampling, with improvements of up to 17.6% in PLCC and 16.7% in SRCC, while reducing the number of required questions. These findings demonstrate that the proposed paradigm provides a reliable and efficient alternative for large-scale MLLM benchmarking.