TeachBench: A Syllabus-Grounded Framework for Evaluating Teaching Ability in Large Language Models

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation of large language models’ (LLMs’) teaching capabilities, particularly in syllabus-grounded, multi-turn instructional settings centered on structured knowledge. We propose the first syllabus-grounded evaluation framework that treats teaching as an independent, measurable dimension of LLM performance. By constraining teacher agents with structured knowledge points and exemplar questions, our approach simulates multi-turn pedagogical interactions and quantifies teaching effectiveness through student pre- and post-test performance gains. This paradigm avoids information leakage and leverages existing evaluation resources. Experiments on a multi-subject Chinese college entrance exam dataset reveal that LLMs perform relatively well in mathematics instruction but face significant challenges in physics and chemistry. Moreover, incorporating exemplar questions does not consistently improve teaching outcomes, as models often focus on correcting examples rather than conveying core concepts.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) show promise as teaching assistants, yet their teaching capability remains insufficiently evaluated. Existing benchmarks mainly focus on problem-solving or problem-level guidance, leaving knowledge-centered teaching underexplored. We propose a syllabus-grounded evaluation framework that measures LLM teaching capability via student performance improvement after multi-turn instruction. By restricting teacher agents to structured knowledge points and example problems, the framework avoids information leakage and enables reuse of existing benchmarks. We instantiate the framework on Gaokao data across multiple subjects. Experiments reveal substantial variation in teaching effectiveness across models and domains: some models perform well in mathematics, while teaching remains challenging in physics and chemistry. We also find that incorporating example problems does not necessarily improve teaching, as models often shift toward example-specific error correction. Overall, our results highlight teaching ability as a distinct and measurable dimension of LLM behavior.
Problem

Research questions and friction points this paper is trying to address.

teaching ability
large language models
evaluation benchmark
syllabus-grounded
knowledge-centered teaching
Innovation

Methods, ideas, or system contributions that make the work stand out.

syllabus-grounded evaluation
teaching ability
large language models
multi-turn instruction
student performance improvement
🔎 Similar Papers
No similar papers found.
Zheng Li
Zheng Li
Peking University
人工智能、自然语言处理
S
Siyao Song
ByteDance BandAI; Institute of Automation, Chinese Academy of Sciences
J
Jingyuan Ma
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; ByteDance BandAI
R
Rui Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; ByteDance BandAI
Y
Ying Zeng
ByteDance BandAI
Minghao Li
Minghao Li
Beihang University
Natural Language Processing
Z
Zhifang Sui
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University