From Tool to Teammate: LLM Coding Agents as Collaborative Partners for Behavioral Labeling in Educational Dialogue Analysis

📅 2026-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the high cost of manually annotating student behaviors in educational dialogues, which hinders learning analytics research. The authors propose leveraging large language models (LLMs) as collaborative agents that autonomously refine their behavioral classification performance through a theory-driven error analysis and iterative prompt optimization framework. Integrating automated prompt tuning, four-fold cross-validation, and researcher review, the approach achieves human-level inter-rater agreement on a held-out test set (κ = 0.78, SD = 0.08) at a cost of only $5–8 per annotation. While development set performance is notably higher (κ = 0.91–0.93), the authors caution against overfitting-induced degradation. Beyond enhancing annotation efficiency, the method uncovers latent behavioral patterns not explicitly captured by human coders—such as classifying expressions of confusion as engagement rather than disengagement.
📝 Abstract
Behavioral analysis of tutoring dialogues is essential for understanding student learning, yet manual coding remains a bottleneck. We present a methodology where LLM coding agents autonomously improve the prompts used by LLM classifiers to label educational dialogues. In each iteration, a coding agent runs the classifier against human-labeled validation data, analyzes disagreements, and proposes theory-grounded prompt modifications for researcher review. Applying this approach to 659 AI tutoring sessions across four experiments with three agents and three classifiers, 4-fold cross-validation on held-out data confirmed genuine improvement: the best agent achieved test $κ=0.78$ (SD$=0.08$), matching human inter-rater reliability ($κ=0.78$), at a cost of approximately \$5--8 per agent. While development-set performance reached $κ=0.91$--$0.93$, the cross-validated results represent our primary generalization claim. The iterative process also surfaced an undocumented labeling pattern: human coders consistently treated expressions of confusion as engagement rather than disengagement. Continued iteration beyond the optimum led to regression, underscoring the need for held-out validation. We release all prompts, iteration logs, and data.
Problem

Research questions and friction points this paper is trying to address.

behavioral labeling
educational dialogue analysis
manual coding bottleneck
LLM coding agents
inter-rater reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM coding agents
collaborative prompting
behavioral labeling
educational dialogue analysis
prompt iteration
🔎 Similar Papers
No similar papers found.
Eason Chen
Eason Chen
Human-Computer Interaction Institute, Carnegie Mellon University
Learning SciencesEducation TechnologiesLearning AnalyticsBlockchain
I
Isabel Wang
Carnegie Mellon University
N
Nina Yuan
Carnegie Mellon University
S
Sophia Judicke
Carnegie Mellon University
K
Kayla Beigh
Carnegie Mellon University
X
Xinyi Tang
Carnegie Mellon University