MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models

📅 2025-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language model (MLLM) evaluation benchmarks suffer from limited data scale, narrow disciplinary coverage, and coarse-grained knowledge structuring. Method: We introduce the first K–12 examination–grounded, multidisciplinary multimodal reasoning benchmark, covering mathematics, physics, chemistry, biology, geography, and information science, with 140K real-world exam items. It features fine-grained knowledge-point annotations, hierarchical difficulty labels, and cross-grade/cross-year splits. We propose an education-informed structured evaluation framework, a knowledge-graph–driven annotation schema, a dynamic templated assessment pipeline, and a prompt-guided bootstrapping strategy—leveraging question types and image styles—to mitigate data contamination. Contribution/Results: Extensive experiments expose critical weaknesses of current MLLMs in cross-disciplinary vision–language reasoning. The benchmark enables reproducible, attributable evaluation and provides concrete, actionable directions for improvement.

Technology Category

Application Category

📝 Abstract
Multimodal reasoning, which integrates language and visual cues into problem solving and decision making, is a fundamental aspect of human intelligence and a crucial step toward artificial general intelligence. However, the evaluation of multimodal reasoning capabilities in Multimodal Large Language Models (MLLMs) remains inadequate. Most existing reasoning benchmarks are constrained by limited data size, narrow domain coverage, and unstructured knowledge distribution. To close these gaps, we introduce MDK12-Bench, a multi-disciplinary benchmark assessing the reasoning capabilities of MLLMs via real-world K-12 examinations. Spanning six disciplines (math, physics, chemistry, biology, geography, and information science), our benchmark comprises 140K reasoning instances across diverse difficulty levels from primary school to 12th grade. It features 6,827 instance-level knowledge point annotations based on a well-organized knowledge structure, detailed answer explanations, difficulty labels and cross-year partitions, providing a robust platform for comprehensive evaluation. Additionally, we present a novel dynamic evaluation framework to mitigate data contamination issues by bootstrapping question forms, question types, and image styles during evaluation. Extensive experiment on MDK12-Bench reveals the significant limitation of current MLLMs in multimodal reasoning. The findings on our benchmark provide insights into the development of the next-generation models. Our data and codes are available at https://github.com/LanceZPF/MDK12.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal reasoning in MLLMs lacks adequate benchmarks
Existing benchmarks have limited size, coverage, and knowledge structure
MDK12-Bench addresses gaps with K-12 exam-based multidisciplinary evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-disciplinary K-12 benchmark for MLLMs
Dynamic evaluation framework prevents data contamination
140K reasoning instances with structured annotations
P
Pengfei Zhou
Shanghai AI Laboratory
F
Fanrui Zhang
Shanghai Innovation Institute, USTC
X
Xiaopeng Peng
RIT
Z
Zhaopan Xu
HIT, Shanghai AI Laboratory
J
Jiaxin Ai
WHU, Shanghai Innovation Institute
Yansheng Qiu
Yansheng Qiu
Wuhan University
Missing Data Analysis
C
Chuanhao Li
Shanghai AI Laboratory
Z
Zhen Li
Shanghai AI Laboratory
M
Ming Li
Shanghai AI Laboratory
Y
Yukang Feng
Shanghai Innovation Institute
Jianwen Sun
Jianwen Sun
Software Engineering Application Technology Lab, Huawei, China
Software engineeringDeep reinforcement learning
Haoquan Zhang
Haoquan Zhang
SphereLab, CUHK
MLLM
Z
Zizhen Li
Shanghai Innovation Institute
Xiaofeng Mao
Xiaofeng Mao
Alibaba Group
Computer VisionAdversarial Machine Learning
Wangbo Zhao
Wangbo Zhao
National University of Singapore
Efficient Deep LearningDynamic Neural NetworkMultimodal Model
K
Kai Wang
NUS
X
Xiaojun Chang
USTC, MBZUAI
Wenqi Shao
Wenqi Shao
Researcher at Shanghai AI Laboratory
Foundation Model EvaluationLLM CompressionEfficient AdaptationMultimodal Learning
Yang You
Yang You
Postdoc, Stanford University
3D visioncomputer graphicscomputational geometry
Kaipeng Zhang
Kaipeng Zhang
Shanghai AI Laboratory
LLMMultimodal LLMsAIGC