EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A lack of hierarchical, cognitively grounded evaluation benchmarks tailored to Chinese K–12 education hinders the safe and compliant deployment of large language models (LLMs) in educational contexts. Method: We propose EduAbility—a taxonomy aligned with Bloom’s and Webb’s cognitive frameworks—and construct the first high-fidelity Chinese educational benchmark covering elementary, middle, and high school. It assesses six dimensions: memory, comprehension, application, reasoning, creation, and ethics. Drawing on authentic examination items, classroom dialogues, and student essays, we design 24 task types and conduct systematic zero-shot and few-shot evaluations across 14 mainstream LLMs. Contribution/Results: Results reveal robust performance on factual tasks but significant limitations in classroom dialogue classification and creative generation; notably, several open-weight models outperform closed-source counterparts on complex reasoning. EduAbility establishes a scalable, interpretable methodological foundation for education-oriented LLM evaluation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) demonstrate significant potential for educational applications. However, their unscrutinized deployment poses risks to educational standards, underscoring the need for rigorous evaluation. We introduce EduEval, a comprehensive hierarchical benchmark for evaluating LLMs in Chinese K-12 education. This benchmark makes three key contributions: (1) Cognitive Framework: We propose the EduAbility Taxonomy, which unifies Bloom's Taxonomy and Webb's Depth of Knowledge to organize tasks across six cognitive dimensions including Memorization, Understanding, Application, Reasoning, Creativity, and Ethics. (2) Authenticity: Our benchmark integrates real exam questions, classroom conversation, student essays, and expert-designed prompts to reflect genuine educational challenges; (3) Scale: EduEval comprises 24 distinct task types with over 11,000 questions spanning primary to high school levels. We evaluate 14 leading LLMs under both zero-shot and few-shot settings, revealing that while models perform well on factual tasks, they struggle with classroom dialogue classification and exhibit inconsistent results in creative content generation. Interestingly, several open source models outperform proprietary systems on complex educational reasoning. Few-shot prompting shows varying effectiveness across cognitive dimensions, suggesting that different educational objectives require tailored approaches. These findings provide targeted benchmarking metrics for developing LLMs specifically optimized for diverse Chinese educational tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' performance in Chinese K-12 education
Assessing models across cognitive dimensions like reasoning and creativity
Addressing risks from unscrutinized deployment in educational settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical cognitive benchmark for Chinese education evaluation
Integrates real exam questions and expert-designed prompts
Comprises 24 task types with over 11,000 questions
🔎 Similar Papers
No similar papers found.
G
Guoqing Ma
Zhejiang Normal University, Zhejiang, China
Jia Zhu
Jia Zhu
Zhejiang Normal University
Artificial IntelligenceKnowledge GraphData QualityComputational Pedagogy
Hanghui Guo
Hanghui Guo
Zhejiang Normal University
Large Language Model
Weijie Shi
Weijie Shi
Hong Kong University of Science and Technology
Y
Yue Cui
Hong Kong University of Science and Technology, Hong Kong, China
Jiawei Shen
Jiawei Shen
Washington University in St.Louis
Machine Learning
Z
Zilong Li
Zhejiang Normal University, Zhejiang, China
Y
Yidan Liang
Zhejiang Normal University, Zhejiang, China