🤖 AI Summary
Sparse and imbalanced student-AI dialogue data in educational settings pose significant challenges for theory-driven deductive coding.
Method: We systematically evaluate large language models (LLMs)—including prompt-engineered GPT-4 and Claude—against fine-tuned BERT classifiers, establishing a human-annotated coding benchmark and bias diagnostic framework to uncover systematic misinterpretations in semantic similarity judgment and theoretical concept mapping. We further propose a hierarchical human-AI collaborative coding paradigm.
Results/Contribution: While LLMs do not outperform supervised fine-tuned BERT in standalone coding, the proposed collaborative workflow achieves a 3.2× improvement in coding efficiency and attains Krippendorff’s α = 0.87—substantially surpassing the reliability ceiling of fully automated coding. Our core contribution is a theory-sensitive, human-AI co-coding paradigm that reconciles scalability with rigorous interpretability and intercoder reliability in educational discourse analysis.
📝 Abstract
With generative artificial intelligence driving the growth of dialogic data in education, automated coding is a promising direction for learning analytics to improve efficiency. This surge highlights the need to understand the nuances of student-AI interactions, especially those rare yet crucial. However, automated coding may struggle to capture these rare codes due to imbalanced data, while human coding remains time-consuming and labour-intensive. The current study examined the potential of large language models (LLMs) to approximate or replace humans in deductive, theory-driven coding, while also exploring how human-AI collaboration might support such coding tasks at scale. We compared the coding performance of small transformer classifiers (e.g., BERT) and LLMs in two datasets, with particular attention to imbalanced head-tail distributions in dialogue codes. Our results showed that LLMs did not outperform BERT-based models and exhibited systematic errors and biases in deductive coding tasks. We designed and evaluated a human-AI collaborative workflow that improved coding efficiency while maintaining coding reliability. Our findings reveal both the limitations of LLMs -- especially their difficulties with semantic similarity and theoretical interpretations and the indispensable role of human judgment -- while demonstrating the practical promise of human-AI collaborative workflows for coding.