🤖 AI Summary
Current evaluations of instructional agents lack systematic assessment of higher-order capabilities essential in real-world teaching contexts, such as diagnosing learner states, dynamically intervening, and operating within instructional systems. This work proposes EduAgentBench—the first comprehensive benchmark grounded in educational theory and spanning the full instructional workflow—comprising 150 high-quality tasks that evaluate agent performance across three dimensions: professional pedagogical judgment, contextualized multi-turn tutoring, and execution of Canvas-style instructional workflows. The benchmark employs an educationally informed task design and a hybrid evaluation mechanism combining multi-signal validation with expert human review. Experimental results reveal that state-of-the-art models exhibit limited proficiency in basic pedagogical judgment and fall significantly short of expert teachers in contextual tutoring and autonomous execution of instructional processes, underscoring a substantial gap before practical deployment in real teaching scenarios.
📝 Abstract
Language agents are increasingly deployed in complex professional workflows, with tutoring emerging as a particularly high-stakes capability that remains largely unmeasured in existing benchmarks. Effective tutor agents require more than producing correct answers or executing accurate tool calls: a robust tutor must diagnose learner state, adapt support over time, make pedagogically justified decisions grounded in educational evidence, and execute interventions within realistic learning-management systems. We introduce EduAgentBench, a source-grounded benchmark for holistically evaluating tutor agents across the full scope of teaching work. It contains 150 quality-controlled tasks across three capability surfaces: professional pedagogical judgment, situated multi-turn tutoring, and Canvas-style teaching workflow completion. Tasks are constructed through a pedagogical-insight-driven pipeline and evaluated with complementary verification signals and human review. Across a comprehensive evaluation of frontier models, our findings reveal that current models are generally capable of bounded pedagogical judgment, but still fall short of professional teaching standards in situated tutoring and autonomous teaching-workflow execution. To our knowledge, EduAgentBench is the first theory-grounded and realistic benchmark for evaluating the holistic teaching capability of tutor agents, providing a measurement foundation for developing future tutor agents that can support realistic teaching work.