KBE-DME: Dynamic Multimodal Evaluation via Knowledge Enhanced Benchmark Evolution

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Current multimodal large language model (MLLM) evaluations rely heavily on static benchmarks, which are vulnerable to data contamination and performance saturation—leading to distorted and unreliable assessments. To address these limitations, we propose Knowledge-Enhanced Benchmark Evolution (KBE), a dynamic evaluation framework that introduces graph-structured knowledge modeling for the first time in multimodal assessment, explicitly capturing visual–textual semantic associations. KBE supports dynamic question generation via vision-based re-sampling and external knowledge injection, enabling controllable difficulty adjustment and continuous benchmark evolution. By transcending static benchmark constraints, KBE significantly mitigates data contamination risks, delays benchmark saturation, and enhances fine-grained capability discrimination across MLLMs. Extensive experiments demonstrate that KBE establishes a more reliable, comprehensive, and sustainably evolving evaluation paradigm for multimodal foundation models.

Technology Category

Application Category

📝 Abstract

The rapid progress of multimodal large language models (MLLMs) calls for more reliable evaluation protocols. Existing static benchmarks suffer from the potential risk of data contamination and saturation, leading to inflated or misleading performance evaluations. To address these issues, we first apply Graph formulation to represent a static or dynamic VQA sample. With the formulation, we propose Knowledge-enhanced Benchmark Evolution(KBE), a dynamic multimodal evaluation framework. KBE first analyzes the original static benchmark, then expands it by integrating multimodal knowledge, transforming the static benchmark into a controllable, dynamic evolving version. Crucially, KBE can both reconstruct questions by Re-selecting visual information in the original image and expand existing questions with external textual knowledge. It enables difficulty-controllable evaluation by adjusting the degree of question exploration. Extensive experiments demonstrate that KBE alleviates the risk of data contamination, data saturation, and provides a more comprehensive assessment of MLLM capabilities.

Problem

Research questions and friction points this paper is trying to address.

Addresses data contamination and saturation in multimodal model evaluation

Transforms static benchmarks into dynamic evolving versions using knowledge

Enables difficulty-controllable assessment through question reconstruction and expansion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses graph formulation to represent multimodal evaluation samples

Expands static benchmarks with integrated multimodal knowledge

Enables difficulty-controllable evaluation through question exploration

🔎 Similar Papers

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

2024-09-17arXiv.orgCitations: 8

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

2024-07-17arXiv.orgCitations: 70