🤖 AI Summary
This work addresses the lack of learning science grounding in existing benchmarks for educational large language models, which limits their ability to comprehensively evaluate model performance in real-world settings. The authors propose the first unified evaluation framework rooted in educational assessment theory, systematically measuring model capabilities across three dimensions: knowledge, skills, and attitudes. Skill evaluation is structured through a four-level hierarchy—center, role, scenario, and sub-scenario—and calibrated using Bloom’s taxonomy for difficulty. Authoritative knowledge items are reused, while novel attitude metrics such as deception resistance and behavioral consistency are introduced. Evaluation of seven state-of-the-art models on a diverse dataset of over 124K multi-disciplinary, multi-role, and multi-difficulty samples reveals no single model excels across all dimensions, underscoring the necessity of multi-axis coordinated assessment and validating the effectiveness of the proposed framework.
📝 Abstract
Large Language Models are increasingly deployed as educational tools, yet existing benchmarks focus on narrow skills and lack grounding in learning sciences. We introduce OpenLearnLM Benchmark, a theory-grounded framework evaluating LLMs across three dimensions derived from educational assessment theory: Knowledge (curriculum-aligned content and pedagogical understanding), Skills (scenario-based competencies organized through a four-level center-role-scenario-subscenario hierarchy), and Attitude (alignment consistency and deception resistance). Our benchmark comprises 124K+ items spanning multiple subjects, educational roles, and difficulty levels based on Bloom's taxonomy. The Knowledge domain prioritizes authentic assessment items from established benchmarks, while the Attitude domain adapts Anthropic's Alignment Faking methodology to detect behavioral inconsistency under varying monitoring conditions. Evaluation of seven frontier models reveals distinct capability profiles: Claude-Opus-4.5 excels in practical skills despite lower content knowledge, while Grok-4.1-fast leads in knowledge but shows alignment concerns. Notably, no single model dominates all dimensions, validating the necessity of multi-axis evaluation. OpenLearnLM provides an open, comprehensive framework for advancing LLM readiness in authentic educational contexts.