LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing benchmarks for high school subject evaluation suffer from static datasets, unimodal inputs, and data contamination, limiting their ability to authentically assess the reasoning capabilities of large language models. To address these limitations, this work introduces LiveK12Bench—a dynamic, multimodal benchmark comprising over 2,000 recent authentic exam questions spanning mathematics, physics, chemistry, and biology. It pioneers a “mock examination” evaluation paradigm that holistically measures both reasoning accuracy and efficiency through an end-to-end testing protocol. The benchmark integrates an automated pipeline for question collection and parsing, a multimodal large model evaluation framework, and a structured scoring mechanism, effectively mitigating data leakage and enabling comprehensive cross-disciplinary assessment. Experiments reveal that even state-of-the-art models like GPT-5 experience a sharp performance drop—from 79 to 53—under realistic exam constraints, highlighting their sensitivity to real-world complexities such as intricate layout formatting.

📝 Abstract

Advanced Large Multimodal Models (LMMs) have demonstrated impressive performance in K-12 reasoning tasks, exhibiting great promise as intelligent tutors. Realizing this potential requires models to navigate real-world examinations effectively, yet most existing benchmarks fail to capture the complexity of authentic testing environments. Specifically, most datasets are static, prone to data contamination, and are often confined to restricted modalities, disciplines, and evaluation criteria. To address these issues, we introduce LiveK12Bench, a dynamic, holistic, multi-disciplinary benchmark designed to evaluate the reasoning abilities of LMMs in realistic examination scenarios. LiveK12Bench comprises 2K+ verified questions spanning Mathematics, Physics, Chemistry, and Biology, sourced from the latest real-world exam papers and designed to grow over time. Our framework features several core innovations: 1) featuring an automated pipeline that continuously ingests and parses the latest examination papers to mitigate data leakage; and 2) proposing a novel `Mock Exam' evaluation scheme, which assesses the ability to complete end-to-end exams autonomously with accurate and efficient reasoning paths. Extensive experiments on 12 LMMs reveal that advanced models suffer substantial performance degradation under exam-realistic constraints: GPT-5's score drops from 79 to 53 (out of 100) when process rigor and efficiency are jointly evaluated. Our findings expose critical vulnerabilities, such as sensitivity to complex visual layouts, highlighting the gap between idealized reasoning capabilities and true educational readiness. Both code and dataset are publicly available.

Problem

Research questions and friction points this paper is trying to address.

Large Multimodal Models

K-12 Examinations

Benchmarking

Realistic Evaluation

Educational Readiness

Innovation

Methods, ideas, or system contributions that make the work stand out.

LiveK12Bench

Large Multimodal Models

dynamic benchmark