When LLMs fall short in Deductive Coding: Model Comparison and Human AI Collaboration Workflow Design

📅 2025-12-24

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Sparse and imbalanced student-AI dialogue data in educational settings pose significant challenges for theory-driven deductive coding. Method: We systematically evaluate large language models (LLMs)—including prompt-engineered GPT-4 and Claude—against fine-tuned BERT classifiers, establishing a human-annotated coding benchmark and bias diagnostic framework to uncover systematic misinterpretations in semantic similarity judgment and theoretical concept mapping. We further propose a hierarchical human-AI collaborative coding paradigm. Results/Contribution: While LLMs do not outperform supervised fine-tuned BERT in standalone coding, the proposed collaborative workflow achieves a 3.2× improvement in coding efficiency and attains Krippendorff’s α = 0.87—substantially surpassing the reliability ceiling of fully automated coding. Our core contribution is a theory-sensitive, human-AI co-coding paradigm that reconciles scalability with rigorous interpretability and intercoder reliability in educational discourse analysis.

Technology Category

Application Category

📝 Abstract

With generative artificial intelligence driving the growth of dialogic data in education, automated coding is a promising direction for learning analytics to improve efficiency. This surge highlights the need to understand the nuances of student-AI interactions, especially those rare yet crucial. However, automated coding may struggle to capture these rare codes due to imbalanced data, while human coding remains time-consuming and labour-intensive. The current study examined the potential of large language models (LLMs) to approximate or replace humans in deductive, theory-driven coding, while also exploring how human-AI collaboration might support such coding tasks at scale. We compared the coding performance of small transformer classifiers (e.g., BERT) and LLMs in two datasets, with particular attention to imbalanced head-tail distributions in dialogue codes. Our results showed that LLMs did not outperform BERT-based models and exhibited systematic errors and biases in deductive coding tasks. We designed and evaluated a human-AI collaborative workflow that improved coding efficiency while maintaining coding reliability. Our findings reveal both the limitations of LLMs -- especially their difficulties with semantic similarity and theoretical interpretations and the indispensable role of human judgment -- while demonstrating the practical promise of human-AI collaborative workflows for coding.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' effectiveness in deductive coding tasks

Addressing imbalanced data challenges in automated dialogue coding

Designing human-AI workflows to improve coding efficiency and reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compared BERT and LLMs for deductive coding tasks

Designed human-AI collaborative workflow for coding efficiency

Addressed imbalanced data with human judgment integration

🔎 Similar Papers

Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models