EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) lack education-specific benchmarks and evaluation frameworks tailored to real-world pedagogical contexts. Method: This paper introduces EduBench, the first comprehensive, education-oriented benchmark dataset, encompassing nine authentic teaching scenarios and over 4,000 educationally grounded prompts, supporting dual-perspective (teacher and student) evaluation. We propose a novel 12-dimensional educational evaluation framework integrating synthetic data generation, a multi-dimensional automated assessment pipeline, human validation, and supervised fine-tuning. Contribution/Results: Experimental results demonstrate that compact, educationally specialized models—trained on EduBench—achieve performance on par with state-of-the-art large models (e.g., DeepSeek-V3, Qwen-Max). To foster reproducibility and community advancement, we fully open-source the dataset, evaluation code, and framework.

Technology Category

Application Category

📝 Abstract
As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers and students. We further apply human annotation to ensure the effectiveness of the model-generated evaluation responses. Additionally, we succeed to train a relatively small-scale model on our constructed dataset and demonstrate that it can achieve performance comparable to state-of-the-art large models (e.g., Deepseek V3, Qwen Max) on the test set. Overall, this work provides a practical foundation for the development and evaluation of education-oriented language models. Code and data are released at https://github.com/ybai-nlp/EduBench.
Problem

Research questions and friction points this paper is trying to address.

Lack of diverse benchmarks for educational LLM evaluation
Need multi-dimensional metrics for teacher-student relevant aspects
Absence of optimized small-scale education-oriented language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces diverse educational benchmark with synthetic data
Proposes multi-dimensional metrics for comprehensive assessment
Trains small model matching state-of-the-art performance
🔎 Similar Papers
No similar papers found.
B
Bin Xu
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Y
Yu Bai
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Huashan Sun
Huashan Sun
Beijing Institute of Technology
AINLP
Y
Yiguan Lin
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Siming Liu
Siming Liu
Associate Professor, Computer Science, Missouri State University
AIGenetic AlgorithmsMachine LearningReinforcement LearningMulti-Agent Systems
Xinyue Liang
Xinyue Liang
PhD student of KTH Royal Institute of Technology
Machine learningDistributed learningNeural networks
Y
Yaolin Li
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Y
Yang Gao
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
H
Heyan Huang
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China