🤖 AI Summary
This work addresses the challenge that large language models (LLMs) struggle to accurately annotate instructional dialogue acts when lacking domain-specific knowledge, particularly for rare and context-dependent labels. The authors propose a domain-adaptive retrieval-augmented generation (RAG) approach that requires no fine-tuning of the generative model itself. Instead, they fine-tune a lightweight embedding model and construct an utterance-level index to dynamically retrieve domain-relevant few-shot examples for the LLM. This method substantially improves annotation performance, achieving Cohen’s κ scores of 0.526–0.580 on TalkMoves and 0.659–0.743 on Eedi, with Top-1 match rates rising to 62.0% and 73.1%, respectively—especially enhancing performance on rare labels. A key insight is that optimizing only the retrieval component suffices, and the utterance-level index proves critical to the observed gains.
📝 Abstract
Automated annotation of pedagogical dialogue is a high-stakes task where LLMs often fail without sufficient domain grounding. We present a domain-adapted RAG pipeline for tutoring move annotation. Rather than fine-tuning the generative model, we adapt retrieval by fine-tuning a lightweight embedding model on tutoring corpora and indexing dialogues at the utterance level to retrieve labeled few-shot demonstrations. Evaluated across two real tutoring dialogue datasets (TalkMoves and Eedi) and three LLM backbones (GPT-5.2, Claude Sonnet 4.6, Qwen3-32b), our best configuration achieves Cohen's $κ$ of 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, substantially outperforming no-retrieval baselines ($κ= 0.275$-$0.413$ and $0.160$-$0.410$). An ablation study reveals that utterance-level indexing, rather than embedding quality alone, is the primary driver of these gains, with top-1 label match rates improving from 39.7\% to 62.0\% on TalkMoves and 52.9\% to 73.1\% on Eedi under domain-adapted retrieval. Retrieval also corrects systematic label biases present in zero-shot prompting and yields the largest improvements for rare and context-dependent labels. These findings suggest that adapting the retrieval component alone is a practical and effective path toward expert-level pedagogical dialogue annotation while keeping the generative model frozen.