🤖 AI Summary
This study addresses the limitations in evaluating large language models (LLMs) for automating instructional systems design (ISD), which stem from the absence of standardized benchmarks and evaluation biases. To this end, the authors propose ISD-Agent-Bench, a large-scale structured benchmark comprising 25,795 scenarios derived from classical ISD frameworks such as ADDIE, decomposed into 33 sub-steps and combined with 51 contextual variables to generate diverse tasks. The work innovatively integrates the ReAct reasoning paradigm with ISD theory to construct an intelligent agent and introduces a multi-LLM adjudicator consensus mechanism to mitigate evaluation bias. Experimental results across 1,017 test scenarios demonstrate that this hybrid approach significantly outperforms baselines relying solely on theory or technical heuristics, with a strong positive correlation observed between theoretical fidelity and system performance.
📝 Abstract
Large Language Model (LLM) agents have shown promising potential in automating Instructional Systems Design (ISD), a systematic approach to developing educational programs. However, evaluating these agents remains challenging due to the lack of standardized benchmarks and the risk of LLM-as-judge bias. We present ISD-Agent-Bench, a comprehensive benchmark comprising 25,795 scenarios generated via a Context Matrix framework that combines 51 contextual variables across 5 categories with 33 ISD sub-steps derived from the ADDIE model. To ensure evaluation reliability, we employ a multi-judge protocol using diverse LLMs from different providers, achieving high inter-judge reliability. We compare existing ISD agents with novel agents grounded in classical ISD theories such as ADDIE, Dick \&Carey, and Rapid Prototyping ISD. Experiments on 1,017 test scenarios demonstrate that integrating classical ISD frameworks with modern ReAct-style reasoning achieves the highest performance, outperforming both pure theory-based agents and technique-only approaches. Further analysis reveals that theoretical quality strongly correlates with benchmark performance, with theory-based agents showing significant advantages in problem-centered design and objective-assessment alignment. Our work provides a foundation for systematic LLM-based ISD research.