OpenLearnLM Benchmark: A Unified Framework for Evaluating Knowledge, Skill, and Attitude in Educational Large Language Models

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of learning science grounding in existing benchmarks for educational large language models, which limits their ability to comprehensively evaluate model performance in real-world settings. The authors propose the first unified evaluation framework rooted in educational assessment theory, systematically measuring model capabilities across three dimensions: knowledge, skills, and attitudes. Skill evaluation is structured through a four-level hierarchy—center, role, scenario, and sub-scenario—and calibrated using Bloom’s taxonomy for difficulty. Authoritative knowledge items are reused, while novel attitude metrics such as deception resistance and behavioral consistency are introduced. Evaluation of seven state-of-the-art models on a diverse dataset of over 124K multi-disciplinary, multi-role, and multi-difficulty samples reveals no single model excels across all dimensions, underscoring the necessity of multi-axis coordinated assessment and validating the effectiveness of the proposed framework.

Technology Category

Application Category

📝 Abstract
Large Language Models are increasingly deployed as educational tools, yet existing benchmarks focus on narrow skills and lack grounding in learning sciences. We introduce OpenLearnLM Benchmark, a theory-grounded framework evaluating LLMs across three dimensions derived from educational assessment theory: Knowledge (curriculum-aligned content and pedagogical understanding), Skills (scenario-based competencies organized through a four-level center-role-scenario-subscenario hierarchy), and Attitude (alignment consistency and deception resistance). Our benchmark comprises 124K+ items spanning multiple subjects, educational roles, and difficulty levels based on Bloom's taxonomy. The Knowledge domain prioritizes authentic assessment items from established benchmarks, while the Attitude domain adapts Anthropic's Alignment Faking methodology to detect behavioral inconsistency under varying monitoring conditions. Evaluation of seven frontier models reveals distinct capability profiles: Claude-Opus-4.5 excels in practical skills despite lower content knowledge, while Grok-4.1-fast leads in knowledge but shows alignment concerns. Notably, no single model dominates all dimensions, validating the necessity of multi-axis evaluation. OpenLearnLM provides an open, comprehensive framework for advancing LLM readiness in authentic educational contexts.
Problem

Research questions and friction points this paper is trying to address.

educational large language models
benchmark
knowledge
skills
attitude
Innovation

Methods, ideas, or system contributions that make the work stand out.

educational LLM evaluation
multi-dimensional benchmark
alignment faking
Bloom's taxonomy
scenario-based assessment
🔎 Similar Papers
No similar papers found.
U
Unggi Lee
Chosun University
S
Sookbun Lee
Independent Researcher
H
Heungsoo Choi
Korea University
J
Jinseo Lee
Ewha Womans University
H
Haeun Park
Korea Institute for Curriculum and Evaluation
Y
Younghoon Jeon
Upstage
Sungmin Cho
Sungmin Cho
Delvine Inc.
M
Minju Kang
Seoul National University
Junbo Koh
Junbo Koh
Educational Technology, Seoul National University
ISDAIEDLearning SciencesLLM(LMM)
Jiyeong Bae
Jiyeong Bae
Korea University
Machine Learning
M
Minwoo Nam
Korea University
J
Juyeon Eun
Korea University
Yeonji Jung
Yeonji Jung
Texas A&M University
Learning AnalyticsLearning Experience DesignLearning Sciences
Yeil Jeong
Yeil Jeong
Indiana University
AI in EducationHuman-AI InteractionDomain-specific LLMs