🤖 AI Summary
This study addresses the overreliance of large language models (LLMs) on aggregated labels and their neglect of inter-annotator disagreement in subjective tasks—e.g., hate speech and offensiveness detection—by proposing a *disagreement-aware* modeling paradigm. Methodologically, it first systematically validates zero-shot multi-perspective generation; designs a hybrid exemplar selection strategy integrating annotation entropy and semantic similarity; and enhances in-context prompting via BM25, PLM embeddings, and curriculum learning. Results show that zero-shot generation yields diverse, plausible subjective judgments; few-shot performance hinges critically on exemplar selection—not ordering; and multi-perspective outputs improve model interpretability and fairness. The core contribution lies in moving beyond the conventional *consensus-oriented* paradigm, offering a novel framework for modeling human subjectivity in LLMs.
📝 Abstract
Large Language Models (LLMs) have shown strong performance on NLP classification tasks. However, they typically rely on aggregated labels-often via majority voting-which can obscure the human disagreement inherent in subjective annotations. This study examines whether LLMs can capture multiple perspectives and reflect annotator disagreement in subjective tasks such as hate speech and offensive language detection. We use in-context learning (ICL) in zero-shot and few-shot settings, evaluating four open-source LLMs across three label modeling strategies: aggregated hard labels, and disaggregated hard and soft labels. In few-shot prompting, we assess demonstration selection methods based on textual similarity (BM25, PLM-based), annotation disagreement (entropy), a combined ranking, and example ordering strategies (random vs. curriculum-based). Results show that multi-perspective generation is viable in zero-shot settings, while few-shot setups often fail to capture the full spectrum of human judgments. Prompt design and demonstration selection notably affect performance, though example ordering has limited impact. These findings highlight the challenges of modeling subjectivity with LLMs and the importance of building more perspective-aware, socially intelligent models.