🤖 AI Summary
This study addresses a critical gap in large language model (LLM)-based educational technologies: the lack of a systematic definition and evaluation of simulated students. The authors formalize the student simulation task for the first time and propose a multidimensional evaluation framework encompassing linguistic, behavioral, and cognitive dimensions. Using real-world math tutoring dialogues, they conduct a comprehensive assessment of various simulation approaches. Their experiments reveal that prevailing prompting strategies fall significantly short in generating authentic student behaviors. While supervised fine-tuning and preference optimization yield modest improvements, performance remains limited, underscoring the inherent difficulty of the task. This work establishes both a theoretical foundation and a standardized benchmark for future research on student simulation in educational AI.
📝 Abstract
Advances in large language models (LLMs) enable many new innovations in education. However, evaluating the effectiveness of new technology requires real students, which is time-consuming and hard to scale up. Therefore, many recent works on LLM-powered tutoring solutions have used simulated students for both training and evaluation, often via simple prompting. Surprisingly, little work has been done to ensure or even measure the quality of simulated students. In this work, we formally define the student simulation task, propose a set of evaluation metrics that span linguistic, behavioral, and cognitive aspects, and benchmark a wide range of student simulation methods on these metrics. We experiment on a real-world math tutoring dialogue dataset, where both automated and human evaluation results show that prompting strategies for student simulation perform poorly; supervised fine-tuning and preference optimization yield much better but still limited performance, motivating future work on this challenging task.