AI Annotation Orchestration: Evaluating LLM verifiers to Improve the Quality of LLM Annotations in Learning Analytics

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Large language models (LLMs) exhibit insufficient reliability in automated coding of tutoring dialogues for learning analytics. Method: This paper proposes a verification-oriented collaborative prompting framework that explicitly models the “verifier → annotator” relationship, systematically comparing self-verification and cross-verification strategies to improve qualitative coding quality. Experiments employ GPT, Claude, and Gemini on real-world tutoring dialogue data, benchmarked against blind human annotations as the gold standard; evaluation uses Cohen’s kappa and human arbitration. Results: The framework achieves a 58% overall improvement in kappa; self-verification nearly doubles inter-annotator agreement, while cross-verification yields a mean 37% gain—with specific model combinations significantly outperforming self-verification. This work formalizes directional verification for the first time, establishing a reproducible, interpretable paradigm to enhance LLM-assisted qualitative coding reliability.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly used to annotate learning interactions, yet concerns about reliability limit their utility. We test whether verification-oriented orchestration-prompting models to check their own labels (self-verification) or audit one another (cross-verification)-improves qualitative coding of tutoring discourse. Using transcripts from 30 one-to-one math sessions, we compare three production LLMs (GPT, Claude, Gemini) under three conditions: unverified annotation, self-verification, and cross-verification across all orchestration configurations. Outputs are benchmarked against a blinded, disagreement-focused human adjudication using Cohen's kappa. Overall, orchestration yields a 58 percent improvement in kappa. Self-verification nearly doubles agreement relative to unverified baselines, with the largest gains for challenging tutor moves. Cross-verification achieves a 37 percent improvement on average, with pair- and construct-dependent effects: some verifier-annotator pairs exceed self-verification, while others reduce alignment, reflecting differences in verifier strictness. We contribute: (1) a flexible orchestration framework instantiating control, self-, and cross-verification; (2) an empirical comparison across frontier LLMs on authentic tutoring data with blinded human"gold"labels; and (3) a concise notation, verifier(annotator) (e.g., Gemini(GPT) or Claude(Claude)), to standardize reporting and make directional effects explicit for replication. Results position verification as a principled design lever for reliable, scalable LLM-assisted annotation in Learning Analytics.

Problem

Research questions and friction points this paper is trying to address.

Improving reliability of LLM annotations for learning interaction analysis

Evaluating self-verification and cross-verification methods for annotation quality

Addressing concerns about LLM annotation reliability in educational discourse coding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using self-verification to double annotation agreement

Cross-verification improves alignment by 37 percent

Flexible orchestration framework with control and verification

🔎 Similar Papers

No similar papers found.