🤖 AI Summary
This work addresses the lack of systematic evaluation of large language models’ (LLMs’) teaching capabilities, particularly in syllabus-grounded, multi-turn instructional settings centered on structured knowledge. We propose the first syllabus-grounded evaluation framework that treats teaching as an independent, measurable dimension of LLM performance. By constraining teacher agents with structured knowledge points and exemplar questions, our approach simulates multi-turn pedagogical interactions and quantifies teaching effectiveness through student pre- and post-test performance gains. This paradigm avoids information leakage and leverages existing evaluation resources. Experiments on a multi-subject Chinese college entrance exam dataset reveal that LLMs perform relatively well in mathematics instruction but face significant challenges in physics and chemistry. Moreover, incorporating exemplar questions does not consistently improve teaching outcomes, as models often focus on correcting examples rather than conveying core concepts.
📝 Abstract
Large language models (LLMs) show promise as teaching assistants, yet their teaching capability remains insufficiently evaluated. Existing benchmarks mainly focus on problem-solving or problem-level guidance, leaving knowledge-centered teaching underexplored. We propose a syllabus-grounded evaluation framework that measures LLM teaching capability via student performance improvement after multi-turn instruction. By restricting teacher agents to structured knowledge points and example problems, the framework avoids information leakage and enables reuse of existing benchmarks. We instantiate the framework on Gaokao data across multiple subjects. Experiments reveal substantial variation in teaching effectiveness across models and domains: some models perform well in mathematics, while teaching remains challenging in physics and chemistry. We also find that incorporating example problems does not necessarily improve teaching, as models often shift toward example-specific error correction. Overall, our results highlight teaching ability as a distinct and measurable dimension of LLM behavior.