🤖 AI Summary
NLP data quality assessment has long relied on inter-annotator agreement, overlooking intra-annotator consistency—the temporal stability of individual annotators’ judgments. This neglect challenges the implicit “gold label as ground truth” assumption. Method: We conduct exploratory repeated annotation experiments across major NLP datasets and quantify intra-annotator agreement using Cohen’s and Fleiss’ Kappa, complemented by qualitative perceptual analysis. Contribution/Results: We demonstrate that mainstream NLP datasets routinely omit intra-annotator consistency reporting; moreover, individual annotators exhibit significant temporal variability in labeling identical texts. We identify and disentangle the dual influence of textual ambiguity and subjectivity on annotation stability. Our work establishes intra-annotator agreement as a foundational data quality metric, providing both a methodological framework and concrete guidelines for constructing more robust, reproducible NLP datasets.
📝 Abstract
We commonly use agreement measures to assess the utility of judgements made by human annotators in Natural Language Processing (NLP) tasks. While inter-annotator agreement is frequently used as an indication of label reliability by measuring consistency between annotators, we argue for the additional use of intra-annotator agreement to measure label stability over time. However, in a systematic review, we find that the latter is rarely reported in this field. Calculating these measures can act as important quality control and provide insights into why annotators disagree. We propose exploratory annotation experiments to investigate the relationships between these measures and perceptions of subjectivity and ambiguity in text items.