OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing educational LLM evaluation benchmarks overemphasize knowledge acquisition while neglecting pedagogical competence assessment, and suffer from narrow subject and question-type coverage—especially in Chinese contexts. To address this, we propose OmniEduBench, the first bilingual (Chinese-focused) dual-dimensional educational evaluation framework jointly assessing *knowledge mastery* and *capability cultivation*. It comprises 24.6K high-quality QA pairs spanning 61 disciplines and 11 question types. A meticulously designed, human-annotated multidimensional labeling schema enables fine-grained subject categorization and diverse question-type evaluation. Extensive experiments across 11 state-of-the-art LLMs reveal that only Gemini-2.5 Pro achieves >60% accuracy on the knowledge dimension, whereas the best-performing model (QWQ) lags human performance by nearly 30 percentage points on the cultivation dimension—highlighting a critical gap in current models’ pedagogical capabilities and underscoring substantial room for improvement.

Technology Category

Application Category

📝 Abstract
With the rapid development of large language models (LLMs), various LLM-based works have been widely applied in educational fields. However, most existing LLMs and their benchmarks focus primarily on the knowledge dimension, largely neglecting the evaluation of cultivation capabilities that are essential for real-world educational scenarios. Additionally, current benchmarks are often limited to a single subject or question type, lacking sufficient diversity. This issue is particularly prominent within the Chinese context. To address this gap, we introduce OmniEduBench, a comprehensive Chinese educational benchmark. OmniEduBench consists of 24.602K high-quality question-answer pairs. The data is meticulously divided into two core dimensions: the knowledge dimension and the cultivation dimension, which contain 18.121K and 6.481K entries, respectively. Each dimension is further subdivided into 6 fine-grained categories, covering a total of 61 different subjects (41 in the knowledge and 20 in the cultivation). Furthermore, the dataset features a rich variety of question formats, including 11 common exam question types, providing a solid foundation for comprehensively evaluating LLMs' capabilities in education. Extensive experiments on 11 mainstream open-source and closed-source LLMs reveal a clear performance gap. In the knowledge dimension, only Gemini-2.5 Pro surpassed 60% accuracy, while in the cultivation dimension, the best-performing model, QWQ, still trailed human intelligence by nearly 30%. These results highlight the substantial room for improvement and underscore the challenges of applying LLMs in education.
Problem

Research questions and friction points this paper is trying to address.

Evaluating cultivation capabilities in Chinese educational LLMs
Addressing limited subject diversity in current benchmarks
Assessing multi-dimensional performance beyond knowledge metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for Chinese educational LLM evaluation
Divides data into knowledge and cultivation dimensions
Includes diverse subjects and multiple question formats
🔎 Similar Papers
No similar papers found.
M
Min Zhang
East China Normal University
H
Hao Chen
East China Normal University
H
Hao Chen
East China Normal University
Wenqi Zhang
Wenqi Zhang
Zhejiang University
Language ModelMultimodal LearningEmbodied Agents
Didi Zhu
Didi Zhu
Imperial College London
Multi-Modal LLMsOut of Distribution Generalization
X
Xin Lin
East China Normal University
B
Bo Jiang
East China Normal University
A
Aimin Zhou
East China Normal University
F
Fei Wu
Zhejiang University
Kun Kuang
Kun Kuang
Zhejiang University
Causal InferenceData MiningMachine Learning