🤖 AI Summary
This study addresses the high cost of manual annotation in educational dialogue research and the lack of systematic evaluation of large language models (LLMs) regarding accuracy and bias in automated labeling. For the first time, it systematically compares three prompting strategies—few-shot, single-agent, and multi-agent reflection—using GPT-5.2 and Gemini-3 to annotate student dialogues spanning K–12 to higher education across multiple disciplines along four dimensions: cognitive, affective, metacognitive, and behavioral. Results indicate that multi-agent prompting achieves the highest accuracy (though not statistically significant), with the best performance on the affective dimension and the weakest on the cognitive dimension. Annotation quality is consistently better for K–12 than for higher education data, and the study identifies four recurring patterns of systematic bias across dimensions and disciplines.
📝 Abstract
Educational dialogue is critical for decoding student learning processes, yet manual annotation remains time-consuming. This study evaluates the efficacy of GPT-5.2 and Gemini-3 using three prompting strategies (few-shot, single-agent, and multi-agent reflection) across diverse subjects, educational levels, and four coding dimensions. Results indicate that while multi-agent prompting achieved the highest accuracy, the results did not reach statistical significance. Accuracy proved highly context-dependent, with significantly higher performance in K-12 datasets compared to university-level data, alongside disciplinary variations within the same educational level. Performance peaked in the affective dimension but remained lowest in the cognitive dimension. Furthermore, analysis revealed four bias patterns: (1) Gemini-3 exhibited a consistent optimistic bias in the affective dimension across all subjects; (2) the cognitive dimension displayed domain-specific directional bias, characterized by systematic underestimation in Mathematics versus overestimation in Psychology; (3) both models are more prone to overestimation than underestimation within the meta-cognitive dimension; and (4) behavioral categories such as question, negotiation, and statements were frequently misclassified. These results underscore the need for context-sensitive deployment and targeted mitigation of directional biases in automated annotation.