🤖 AI Summary
Existing large language models (LLMs) exhibit limited scalability and adaptability to large-scale, heterogeneous curricular content in educational applications, and lack systematic frameworks for pedagogical quality assessment. Method: We propose the first multi-LLM agent dialogue framework tailored for procedural knowledge instruction, comprising coordinated Teacher, Learner, Interaction Manager, and Evaluator agents—integrated via prompt engineering, role-based simulation, and workflow control. Contribution/Results: We construct a large-scale instructional dataset spanning 17 disciplines, 727 topics, and over 110,000 dialogues, and design a three-dimensional evaluation protocol combining computational metrics, structured rubrics, and human assessment. Experiments demonstrate significant improvements in cross-disciplinary teaching effectiveness, interaction quality, and interpretability. All data and code are publicly released to advance AI4Education research.
📝 Abstract
Large language models (LLMs) have advanced virtual educators and learners, bridging NLP with AI4Education. Existing work often lacks scalability and fails to leverage diverse, large-scale course content, with limited frameworks for assessing pedagogic quality. To this end, we propose WikiHowAgent, a multi-agent workflow leveraging LLMs to simulate interactive teaching-learning conversations. It integrates teacher and learner agents, an interaction manager, and an evaluator to facilitate procedural learning and assess pedagogic quality. We introduce a dataset of 114,296 teacher-learner conversations grounded in 14,287 tutorials across 17 domains and 727 topics. Our evaluation protocol combines computational and rubric-based metrics with human judgment alignment. Results demonstrate the workflow's effectiveness in diverse setups, offering insights into LLM capabilities across domains. Our datasets and implementations are fully open-sourced.