🤖 AI Summary
This study addresses the lack of effective mechanisms for evaluating the correctness of large language model (LLM) predictions in the automatic annotation of educational dialogues. The authors propose leveraging the LLM’s own generated reasoning texts to train supervised classifiers that assess the reliability of its label predictions. Using a dataset of 30,300 teacher utterances annotated by multiple models along with their corresponding reasoning traces, they encode the reasoning texts via TF-IDF and evaluate five classifiers, including Random Forest, complemented by LIWC-based linguistic feature analysis. Experimental results show that Random Forest achieves an F1 score of 0.83 (recall: 0.854), significantly outperforming baseline methods. Specialized detectors tailored to specific instructional behaviors further improve performance on challenging classes. This work provides the first evidence that LLM-generated reasoning can serve as a reliable indicator of prediction correctness, revealing that accurate reasoning tends to feature causal connectives, whereas erroneous reasoning often contains cognitive uncertainty and metacognitive expressions.
📝 Abstract
Large Language Models (LLMs) are increasingly deployed to automatically label and analyze educational dialogue at scale, yet current pipelines lack reliable ways to detect when models are wrong. We investigate whether reasoning generated by LLMs can be used to predict the correctness of a model's own predictions. We analyze 30,300 teacher utterances from classroom dialogue, each labeled by multiple state-of-the-art LLMs with an instructional move construct and an accompanying reasoning. Using human-verified ground-truth labels, we frame the task as predicting whether a model's assigned label for a given utterance is correct. We encode LLM reasoning using Term Frequency-Inverse Document Frequency (TF-IDF) and evaluate five supervised classifiers. A Random Forest classifier achieves an F1 score of 0.83 (Recall = 0.854), successfully identifying most incorrect predictions and outperforming baselines. Training specialist detectors for specific instructional move constructs further improves performance on difficult constructs, indicating that error detection benefits from construct-specific linguistic cues. Using the Linguistic Inquiry and Word Count (LIWC) framework, we examine four linguistic markers of correctness: Causation, Differentiation, Tentativeness, and Insight. Correct predictions exhibit grounded causal language (e.g., because, therefore), while incorrect reasoning is substantially more likely to rely on epistemic hedging (e.g., might, could) and performative metacognition (e.g., think, realize). Syntactic complexity does not distinguish correct from incorrect reasoning, and longer reasoning is not more reliable. These findings demonstrate that reasoning-based error detection offers a practical and scalable approach to quality control in automated educational dialogue analysis.