MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams

📅 2025-08-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing MLLM evaluation benchmarks suffer from limited scale, narrow disciplinary coverage, and fragmented knowledge structures, hindering fine-grained and dynamic assessment. To address this, we propose MDK12-Bench—the first large-scale, multi-disciplinary benchmark constructed from authentic K–12 examinations, encompassing six subjects, 141K questions, and 6,225 granular knowledge points. We design a dynamic evaluation framework introducing three novel perturbation types: visual, textual, and question-format variations; and establish the first six-layer systematic knowledge taxonomy. Additionally, we propose KP-RAG, a knowledge-point-augmented retrieval-augmented generation method for interpretable reasoning analysis. Experiments expose critical bottlenecks in current MLLMs—including cross-year generalization, contextual robustness, and knowledge-driven reasoning—providing empirical foundations and methodological support for enhancing model interpretability, improving robustness, and advancing AI-enabled educational applications.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs), which integrate language and visual cues for problem-solving, are crucial for advancing artificial general intelligence (AGI). However, current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge, offering only static and undifferentiated evaluations. To bridge this gap, we introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K-12 exams spanning six disciplines with 141K instances and 6,225 knowledge points organized in a six-layer taxonomy. Covering five question formats with difficulty and year annotations, it enables comprehensive evaluation to capture the extent to which MLLMs perform over four dimensions: 1) difficulty levels, 2) temporal (cross-year) shifts, 3) contextual shifts, and 4) knowledge-driven reasoning. We propose a novel dynamic evaluation framework that introduces unfamiliar visual, textual, and question form shifts to challenge model generalization while improving benchmark objectivity and longevity by mitigating data contamination. We further evaluate knowledge-point reference-augmented generation (KP-RAG) to examine the role of knowledge in problem-solving. Key findings reveal limitations in current MLLMs in multiple aspects and provide guidance for enhancing model robustness, interpretability, and AI-assisted education.
Problem

Research questions and friction points this paper is trying to address.

Evaluates MLLMs on multidisciplinary K-12 exams comprehensively
Addresses limitations in current MLLM benchmarks and evaluations
Proposes dynamic framework to test model generalization and knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic evaluation framework with unfamiliar shifts
Large-scale multidisciplinary benchmark MDK12-Bench
Knowledge-point reference-augmented generation KP-RAG
🔎 Similar Papers