Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

๐Ÿ“… 2026-05-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

180K/year
๐Ÿค– AI Summary
This work addresses the absence of evaluation benchmarks for multimodal large language models (MLLMs) grounded in authentic Kโ€“12 assessments. The authors introduce the first multimodal dataset derived from Japanโ€™s nationwide academic achievement survey, encompassing middle school science, mathematics, and Japanese language subjects. The dataset preserves original test layouts, diagrams, and educational texts, and incorporates response distributions from approximately 900,000 students, enabling direct humanโ€“AI performance comparison. Open-ended responses are evaluated using exact-match accuracy and character-level F1 scores, with automatic scoring reliability validated through both human raters and LLM-as-judge approaches. This benchmark reveals performance disparities among MLLMs across disciplines and visual reasoning tasks, establishing a reproducible and fine-grained foundation for evaluating AI systems in educational contexts.
๐Ÿ“ Abstract
Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed from Japan's National Assessment of Academic Ability, comprising officially released middle-school items in Science, Mathematics, and Japanese Language. Unlike existing benchmarks based on synthetic or curated data, our dataset preserves real exam layouts, diagrams, and Japanese educational text, together with nationwide aggregated student response distributions (N $\approx$ 900{,}000). These features enable direct comparison between human and model performance under a unified evaluation framework. We benchmark recent multimodal LLMs using exact-match accuracy and character-level F1 for open-ended responses, observing substantial variation across subjects and strong sensitivity to visual reasoning demands. Human evaluation and LLM-as-judge analyses further assess the reliability of automatic scoring. Our dataset establishes a reproducible, human-grounded benchmark for multimodal educational reasoning and supports future research on evaluation, feedback generation, and explainable AI in authentic assessment contexts. Our dataset is available at: https://github.com/KyosukeTakami/gakucho-benchmark
Problem

Research questions and friction points this paper is trying to address.

multimodal benchmark
educational assessment
human-grounded evaluation
large language models
student response distributions
Innovation

Methods, ideas, or system contributions that make the work stand out.

human-grounded benchmark
multimodal LLM evaluation
national assessment data
student response distribution
educational AI