🤖 AI Summary
Existing multimodal large language models (MLLMs) suffer from incomparable and incomplete visual perception evaluations due to fragmented benchmarks—varying question formats, domains, and metrics. This work introduces AbilityLens, the first capability-dimension-driven, multidimensional, and stable evaluation benchmark for systematically assessing accuracy and stability across six core visual perception abilities. Methodologically, we propose perception capability-decoupled evaluation and an online dynamic assessment mechanism, uncovering capability conflicts and early convergence phenomena. Based on these findings, we design a lightweight optimization strategy: capability-specific checkpoint fusion. Experiments reveal substantial gaps in perception stability between open- and closed-source MLLMs and demonstrate that our strategy effectively mitigates training performance degradation. The AbilityLens benchmark and its real-time leaderboard will be publicly released.
📝 Abstract
As multimodal large language models (MLLMs) advance rapidly, rigorous evaluation has become essential, providing further guidance for their development. In this work, we focus on a unified and robust evaluation of extbf{vision perception} abilities, the foundational skill of MLLMs. We find that existing perception benchmarks, each focusing on different question types, domains, and evaluation metrics, introduce significant evaluation variance, complicating comprehensive assessments of perception abilities when relying on any single benchmark. To address this, we introduce extbf{AbilityLens}, a unified benchmark designed to evaluate MLLMs across six key perception abilities, focusing on both accuracy and stability, with each ability encompassing diverse question types, domains, and metrics. With the assistance of AbilityLens, we: (1) identify the strengths and weaknesses of current models, highlighting stability patterns and revealing a notable performance gap between open-source and closed-source models; (2) introduce an online evaluation mode, which uncovers interesting ability conflict and early convergence phenomena during MLLM training; and (3) design a simple ability-specific model merging method that combines the best ability checkpoint from early training stages, effectively mitigating performance decline due to ability conflict. The benchmark and online leaderboard will be released soon.