Consistency is Key: Disentangling Label Variation in Natural Language Processing with Intra-Annotator Agreement

📅 2023-01-25
🏛️ arXiv.org
📈 Citations: 22
Influential: 1
📄 PDF
🤖 AI Summary
NLP data quality assessment has long relied on inter-annotator agreement, overlooking intra-annotator consistency—the temporal stability of individual annotators’ judgments. This neglect challenges the implicit “gold label as ground truth” assumption. Method: We conduct exploratory repeated annotation experiments across major NLP datasets and quantify intra-annotator agreement using Cohen’s and Fleiss’ Kappa, complemented by qualitative perceptual analysis. Contribution/Results: We demonstrate that mainstream NLP datasets routinely omit intra-annotator consistency reporting; moreover, individual annotators exhibit significant temporal variability in labeling identical texts. We identify and disentangle the dual influence of textual ambiguity and subjectivity on annotation stability. Our work establishes intra-annotator agreement as a foundational data quality metric, providing both a methodological framework and concrete guidelines for constructing more robust, reproducible NLP datasets.
📝 Abstract
We commonly use agreement measures to assess the utility of judgements made by human annotators in Natural Language Processing (NLP) tasks. While inter-annotator agreement is frequently used as an indication of label reliability by measuring consistency between annotators, we argue for the additional use of intra-annotator agreement to measure label stability over time. However, in a systematic review, we find that the latter is rarely reported in this field. Calculating these measures can act as important quality control and provide insights into why annotators disagree. We propose exploratory annotation experiments to investigate the relationships between these measures and perceptions of subjectivity and ambiguity in text items.
Problem

Research questions and friction points this paper is trying to address.

Measuring intra-annotator agreement for label stability in NLP tasks
Investigating reasons for annotator disagreement through quality control measures
Assessing annotator inconsistency across multiple NLP classification tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes intra-annotator agreement for label stability
Measures annotator consistency over time systematically
Conducts annotation experiments across four NLP tasks
🔎 Similar Papers
No similar papers found.