Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the absence of evaluation benchmarks for multimodal large language models (MLLMs) grounded in authentic K–12 assessments. The authors introduce the first multimodal dataset derived from Japan’s nationwide academic achievement survey, encompassing middle school science, mathematics, and Japanese language subjects. The dataset preserves original test layouts, diagrams, and educational texts, and incorporates response distributions from approximately 900,000 students, enabling direct human–AI performance comparison. Open-ended responses are evaluated using exact-match accuracy and character-level F1 scores, with automatic scoring reliability validated through both human raters and LLM-as-judge approaches. This benchmark reveals performance disparities among MLLMs across disciplines and visual reasoning tasks, establishing a reproducible and fine-grained foundation for evaluating AI systems in educational contexts.

📝 Abstract

Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed from Japan's National Assessment of Academic Ability, comprising officially released middle-school items in Science, Mathematics, and Japanese Language. Unlike existing benchmarks based on synthetic or curated data, our dataset preserves real exam layouts, diagrams, and Japanese educational text, together with nationwide aggregated student response distributions (N $\approx$ 900{,}000). These features enable direct comparison between human and model performance under a unified evaluation framework. We benchmark recent multimodal LLMs using exact-match accuracy and character-level F1 for open-ended responses, observing substantial variation across subjects and strong sensitivity to visual reasoning demands. Human evaluation and LLM-as-judge analyses further assess the reliability of automatic scoring. Our dataset establishes a reproducible, human-grounded benchmark for multimodal educational reasoning and supports future research on evaluation, feedback generation, and explainable AI in authentic assessment contexts. Our dataset is available at: https://github.com/KyosukeTakami/gakucho-benchmark

Problem

Research questions and friction points this paper is trying to address.

multimodal benchmark

educational assessment

human-grounded evaluation

large language models

student response distributions

Innovation

Methods, ideas, or system contributions that make the work stand out.

human-grounded benchmark

multimodal LLM evaluation

national assessment data

student response distribution

educational AI

🔎 Similar Papers

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

2024-05-24arXiv.orgCitations: 6

PATCH - Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Mathematics Proficiency

2024-04-02arXiv.orgCitations: 3