🤖 AI Summary
Current LMM evaluation faces a trilemma: achieving broad task coverage, low cost, and zero contamination simultaneously remains infeasible. To address this, we propose LMMS-EVAL—the first unified, standardized large-scale multimodal model evaluation framework—comprising 50+ diverse tasks across 10+ models. We further introduce LMMS-EVAL Lite, a lightweight variant for accelerated evaluation, and Multimodal LIVEBENCH, the first dynamically updated benchmark leveraging real-time news and community data streams to enable zero-human-annotation, weekly-updated, real-world scenario assessment. The framework features modular architecture, automated task orchestration, contamination detection, and cross-model inference adapters. Its open-source implementation and live leaderboard are already adopted by 20+ institutions. LIVEBENCH reduces evaluation cost by 76% and drives contamination rates toward zero, advancing a reproducible, sustainable, and application-aligned LMM evaluation paradigm.
📝 Abstract
The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMS-EVAL, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMS-EVAL offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMS-EVAL LITE, a pruned evaluation toolkit that emphasizes both coverage and efficiency. Additionally, we present Multimodal LIVEBENCH that utilizes continuously updating news and online forums to assess models' generalization abilities in the wild, featuring a low-cost and zero-contamination evaluation approach. In summary, our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models, paving the way for more effective and reliable benchmarking of LMMs. We opensource our codebase and maintain leaderboard of LIVEBENCH at https://github.com/EvolvingLMMs-Lab/lmms-eval and https://huggingface.co/spaces/lmms-lab/LiveBench.