Bridging the Gap: In-Context Learning for Modeling Human Disagreement

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This study addresses the overreliance of large language models (LLMs) on aggregated labels and their neglect of inter-annotator disagreement in subjective tasks—e.g., hate speech and offensiveness detection—by proposing a *disagreement-aware* modeling paradigm. Methodologically, it first systematically validates zero-shot multi-perspective generation; designs a hybrid exemplar selection strategy integrating annotation entropy and semantic similarity; and enhances in-context prompting via BM25, PLM embeddings, and curriculum learning. Results show that zero-shot generation yields diverse, plausible subjective judgments; few-shot performance hinges critically on exemplar selection—not ordering; and multi-perspective outputs improve model interpretability and fairness. The core contribution lies in moving beyond the conventional *consensus-oriented* paradigm, offering a novel framework for modeling human subjectivity in LLMs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown strong performance on NLP classification tasks. However, they typically rely on aggregated labels-often via majority voting-which can obscure the human disagreement inherent in subjective annotations. This study examines whether LLMs can capture multiple perspectives and reflect annotator disagreement in subjective tasks such as hate speech and offensive language detection. We use in-context learning (ICL) in zero-shot and few-shot settings, evaluating four open-source LLMs across three label modeling strategies: aggregated hard labels, and disaggregated hard and soft labels. In few-shot prompting, we assess demonstration selection methods based on textual similarity (BM25, PLM-based), annotation disagreement (entropy), a combined ranking, and example ordering strategies (random vs. curriculum-based). Results show that multi-perspective generation is viable in zero-shot settings, while few-shot setups often fail to capture the full spectrum of human judgments. Prompt design and demonstration selection notably affect performance, though example ordering has limited impact. These findings highlight the challenges of modeling subjectivity with LLMs and the importance of building more perspective-aware, socially intelligent models.

Problem

Research questions and friction points this paper is trying to address.

LLMs obscure human disagreement in subjective annotations

Assessing LLMs' ability to capture multiple perspectives in hate speech detection

Evaluating in-context learning for modeling annotator disagreement in NLP tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-context learning for human disagreement modeling

Multi-perspective generation in zero-shot settings

Disaggregated label strategies for subjectivity capture

🔎 Similar Papers

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks