🤖 AI Summary
This work addresses the challenge of substantial annotator disagreement in subjective NLP tasks, where conventional majority voting discards valuable perspective diversity and individual annotator modeling suffers from high cost and poor generalization. The authors propose an aggregation approach based on clustering annotators by consistency and systematically evaluate four strategies—majority voting, ensemble methods, multi-label learning, and multi-task learning—across three tasks (sentiment analysis, emotion classification, and hate speech detection) and 40 multilingual datasets. Experimental results demonstrate that integrating consistency-based annotator clustering with multi-label or multi-task learning effectively preserves annotation diversity while significantly improving classification performance, outperforming both majority voting and per-annotator modeling by leveraging disagreement as informative signal rather than noise.
📝 Abstract
Disagreement in annotation is a common phenomenon in the development of NLP datasets and serves as a valuable source of insight. While majority voting remains the dominant strategy for aggregating labels, recent work has explored modeling individual annotators to preserve their perspectives. However, modeling each annotator is resource-intensive and remains underexplored across various NLP tasks. We propose an agreement-based clustering technique to model the disagreement between the annotators. We conduct comprehensive experiments in 40 datasets in 18 typologically diverse languages, covering three subjective NLP tasks: sentiment analysis, emotion classification, and hate speech detection. We evaluate four aggregation approaches: majority vote, ensemble, multi-label, and multitask. The results demonstrate that agreement-based clustering can leverage the full spectrum of annotator perspectives and significantly enhance classification performance in subjective NLP tasks compared to majority voting and individual annotator modeling. Regarding the aggregation approach, the multi-label and multitask approaches are better for modeling clustered annotators than an ensemble and model majority vote.