🤖 AI Summary
Manual coding of collaborative problem-solving (CPS) dialogues is labor-intensive, error-prone, and poorly scalable—hindering large-scale assessment of 21st-century competencies.
Method: This study investigates the feasibility of automating CPS dialogue coding using large language models (LLMs), proposing a misclassification-feedback-driven prompt optimization framework. We systematically evaluate multiple generations of ChatGPT—including GPT-4o-mini and GPT-4o3-mini—on CPS communication coding tasks across five real-world datasets and two established coding frameworks.
Contribution/Results: We report the first empirical finding that reasoning-enhanced LLMs do not necessarily outperform lightweight variants in this domain. Our approach achieves acceptable coding quality, with prompt optimization significantly improving accuracy on specific subtasks. The resulting paradigm is the first reproducible, scalable, and cross-dataset AI-assisted coding framework for educational assessment, enabling rigorous, high-throughput evaluation of collaborative problem-solving skills.
📝 Abstract
Collaborative problem solving (CPS) is widely recognized as a critical 21st-century skill. Assessing CPS depends heavily on coding the communication data using a construct-relevant framework, and this process has long been a major bottleneck to scaling up such assessments. Based on five datasets and two coding frameworks, we demonstrate that ChatGPT can code communication data to a satisfactory level, though performance varies across ChatGPT models, and depends on the coding framework and task characteristics. Interestingly, newer reasoning-focused models such as GPT-o1-mini and GPT-o3-mini do not necessarily yield better coding results. Additionally, we show that refining prompts based on feedback from miscoded cases can improve coding accuracy in some instances, though the effectiveness of this approach is not consistent across all tasks. These findings offer practical guidance for researchers and practitioners in developing scalable, efficient methods to analyze communication data in support of 21st-century skill assessment.