🤖 AI Summary
This study addresses the limitations of relying solely on a single inter-rater reliability (IRR) threshold—such as Cohen’s kappa—as a ground-truth standard in AI-driven educational systems, which often compromises data reliability and validity. To overcome this, the work proposes an integrated framework featuring four key innovations: treating IRR as a diagnostic tool rather than a rigid inclusion criterion; transparently reporting annotator qualifications and annotation protocols; implementing bias auditing and validation mechanisms leveraging large language models; and incorporating multidimensional validity evidence through uncertainty-aware labeling, predictive criterion validation, and multimodal segment-wise annotation. Empirical validation via multimodal tutoring scenarios demonstrates that this paradigm substantially enhances both annotation quality and instructional efficacy, offering a scientifically grounded and practically actionable approach to constructing robust ground-truth data for educational AI applications.
📝 Abstract
Generative Artificial Intelligence (GenAI) is now widespread in education, yet the efficacy of GenAI systems remains constrained by the quality and interpretation of the labeled data used to train and evaluate them. Studies commonly report inter-rater reliability (IRR), often summarized by a single coefficient such as Cohen's kappa (k), as a gatekeeper to ``ground truth.'' We argue that many educational assessment and practice support settings include challenges, such as high-inference constructs, skewed label distributions, and temporally segmented multimodal data, which yield potential misapplication or misinterpretation of threshold-based heuristics for IRR. The growing use of large language models as annotators and judges introduces risks such as automation bias and circular validation. We propose four practical shifts for establishing ground truth: (1) treat IRR as a diagnostic signal to localize disagreement and refine constructs rather than a mechanical acceptance threshold (e.g., k > 0.8); (2) require transparent reporting of rater expertise, codebook development, reconciliation procedures, and segmentation rules; (3) mitigate risks in LLM annotation through bias audits and verification workflows; and (4) complement agreement statistics with validity and effectiveness evidence for the intended use, including uncertainty-aware labeling (e.g., assigning different labels to the same item to capture nuance), criterion-related checks (e.g., predictive tests to check if labels forecast the intended outcome), and close-the-loop evaluations of whether systems trained on these labels improve learning beyond a reasonable control. We illustrate these shifts through case studies of multimodal tutoring data and provide actionable recommendations toward strengthening the evidence base of labeled AIED datasets.