A Multi-To-One Interview Paradigm for Efficient MLLM Evaluation

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing MLLM evaluation methods suffer from high redundancy and low efficiency. To address this, we propose a novel “one-interviewer–multiple-models” interview-style evaluation paradigm, inspired by human recruitment interviews. Our method comprises two stages—pre-interview and formal interview—and integrates dynamic judge-weight adjustment with adaptive question difficulty selection, thereby establishing an efficient and fair evaluation framework. Crucially, we model evaluation as a structured interactive process rather than static question sampling. Extensive experiments across multiple benchmarks demonstrate that our approach achieves strong correlation with full-scale evaluation using only ~30% of the questions: Pearson and Spearman correlation coefficients improve by 17.6% and 16.7%, respectively, significantly outperforming random sampling. This work introduces a principled, interaction-driven paradigm for efficient and reliable MLLM assessment.

Technology Category

Application Category

📝 Abstract
The rapid progress of Multi-Modal Large Language Models (MLLMs) has spurred the creation of numerous benchmarks. However, conventional full-coverage Question-Answering evaluations suffer from high redundancy and low efficiency. Inspired by human interview processes, we propose a multi-to-one interview paradigm for efficient MLLM evaluation. Our framework consists of (i) a two-stage interview strategy with pre-interview and formal interview phases, (ii) dynamic adjustment of interviewer weights to ensure fairness, and (iii) an adaptive mechanism for question difficulty-level chosen. Experiments on different benchmarks show that the proposed paradigm achieves significantly higher correlation with full-coverage results than random sampling, with improvements of up to 17.6% in PLCC and 16.7% in SRCC, while reducing the number of required questions. These findings demonstrate that the proposed paradigm provides a reliable and efficient alternative for large-scale MLLM benchmarking.
Problem

Research questions and friction points this paper is trying to address.

Addresses inefficiency in MLLM evaluation benchmarks
Reduces redundancy in multimodal model assessment
Proposes interview paradigm for scalable performance testing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage interview strategy for efficiency
Dynamic interviewer weight adjustment for fairness
Adaptive question difficulty mechanism selection
🔎 Similar Papers