🤖 AI Summary
This study addresses the high cost of manually annotating student behaviors in educational dialogues, which hinders learning analytics research. The authors propose leveraging large language models (LLMs) as collaborative agents that autonomously refine their behavioral classification performance through a theory-driven error analysis and iterative prompt optimization framework. Integrating automated prompt tuning, four-fold cross-validation, and researcher review, the approach achieves human-level inter-rater agreement on a held-out test set (κ = 0.78, SD = 0.08) at a cost of only $5–8 per annotation. While development set performance is notably higher (κ = 0.91–0.93), the authors caution against overfitting-induced degradation. Beyond enhancing annotation efficiency, the method uncovers latent behavioral patterns not explicitly captured by human coders—such as classifying expressions of confusion as engagement rather than disengagement.
📝 Abstract
Behavioral analysis of tutoring dialogues is essential for understanding student learning, yet manual coding remains a bottleneck. We present a methodology where LLM coding agents autonomously improve the prompts used by LLM classifiers to label educational dialogues. In each iteration, a coding agent runs the classifier against human-labeled validation data, analyzes disagreements, and proposes theory-grounded prompt modifications for researcher review. Applying this approach to 659 AI tutoring sessions across four experiments with three agents and three classifiers, 4-fold cross-validation on held-out data confirmed genuine improvement: the best agent achieved test $κ=0.78$ (SD$=0.08$), matching human inter-rater reliability ($κ=0.78$), at a cost of approximately \$5--8 per agent. While development-set performance reached $κ=0.91$--$0.93$, the cross-validated results represent our primary generalization claim. The iterative process also surfaced an undocumented labeling pattern: human coders consistently treated expressions of confusion as engagement rather than disengagement. Continued iteration beyond the optimum led to regression, underscoring the need for held-out validation. We release all prompts, iteration logs, and data.