🤖 AI Summary
Current evaluations of large educational language models predominantly focus on factual question answering, overlooking the assessment of curricular cognitive abilities—such as understanding knowledge structures, prerequisite relationships, and pedagogical sequencing. This work presents the first systematic construction of a curriculum-aligned knowledge graph spanning mathematics, physics, chemistry, and biology across K–12 education, encompassing seven node types and nine relation types. Building upon this graph, we introduce a family of curriculum cognition–oriented tasks and develop both a benchmark, K12-Bench, and a training corpus, K12-Train. Experiments reveal that mainstream models achieve limited performance on K12-Bench (peak accuracy: 57%). In contrast, small-scale models fine-tuned with K12-Train significantly outperform comparable instruction-tuning approaches and achieve state-of-the-art results on GaokaoBench and EduEval, demonstrating the efficacy of knowledge graph–driven, sample-efficient training paradigms in educational settings.
📝 Abstract
Large language models (LLMs) are increasingly used in K-12 education, yet existing benchmarks such as C-Eval, CMMLU, GaokaoBench, and EduEval mainly evaluate factual recall through exam-style question answering. Effective educational AI additionally requires curriculum cognition: understanding how knowledge is structured through prerequisite chains, concept taxonomies, experiment-concept links, and pedagogical sequencing. To address this gap, we introduce K12-KGraph, a curriculum-aligned knowledge graph extracted from official People's Education Press textbooks across mathematics, physics, chemistry, and biology from primary to high school. The graph contains seven node types (Concept, Skill, Experiment, Exercise, Section, Chapter, Book) and nine relation types covering taxonomy, prerequisite, association, verification, assessment, location, and order. Based on this graph, we construct two resources: (1) K12-Bench, a 23,640-question multi-select benchmark spanning five graph-derived task families (Ground, Prereq, Neighbor, Evidence, and Locate); and (2) K12-Train, a KG-guided supervised fine-tuning corpus of approximately 2,300 QA pairs synthesized from graph structure and node attributes. Experiments reveal substantial deficiencies in curriculum cognition: on K12-Bench, Gemini-3-Flash achieves only 57% exact match, while the best open-source model, Gemma-4-31B-IT, reaches 46%. Under a strictly matched 2,300-sample SFT budget on Qwen3-4B-Base and Llama-3.1-8B-Base, K12-Train consistently outperforms equally sized subsets from eight mainstream instruction-tuning corpora on both GaokaoBench and EduEval, demonstrating that curriculum-structured supervision is highly sample-efficient for educational tuning. We release the graph, benchmark, training data, and full construction pipeline.