Learning Who Disagrees: Demographic Importance Weighting for Modeling Annotator Distributions with DiADEM

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses a critical limitation in existing label modeling approaches, which overlook the legitimate disagreements among annotators arising from differences in social identity and lived experience, instead relying on majority labels that obscure viewpoint diversity. To remedy this, the authors propose DiADEM, a novel neural architecture that introduces, for the first time, a learnable demographic importance vector α. DiADEM explicitly models the relationship between annotator demographics and their labeling distributions through demographic-aware projection encoding, complementary concatenation, and Hadamard product-based interaction fusion. Furthermore, it incorporates an item-level disagreement-aware loss function that directly optimizes the predicted annotation variance. Evaluated on the DICES and VOICED benchmarks, DiADEM significantly outperforms both large language models and neural baselines—achieving a correlation of r = 0.75 in disagreement tracking on DICES—and reveals race and age as key drivers of subjective judgment divergence.

Technology Category

Application Category

📝 Abstract

When humans label subjective content, they disagree, and that disagreement is not noise. It reflects genuine differences in perspective shaped by annotators' social identities and lived experiences. Yet standard practice still flattens these judgments into a single majority label, and recent LLM-based approaches fare no better: we show that prompted large language models, even with chain-of-thought reasoning, fail to recover the structure of human disagreement. We introduce DiADEM, a neural architecture that learns "how much each demographic axis matters" for predicting who will disagree and on what. DiADEM encodes annotators through per-demographic projections governed by a learned importance vector $\boldsymbolα$, fuses annotator and item representations via complementary concatenation and Hadamard interactions, and is trained with a novel item-level disagreement loss that directly penalizes mispredicted annotation variance. On the DICES conversational-safety and VOICED political-offense benchmarks, DiADEM substantially outperforms both the LLM-as-a-judge and neural model baselines across standard and perspectivist metrics, achieving strong disagreement tracking ($r{=}0.75$ on DICES). The learned $\boldsymbolα$ weights reveal that race and age consistently emerge as the most influential demographic factors driving annotator disagreement across both datasets. Our results demonstrate that explicitly modeling who annotators are not just what they label is essential for NLP systems that aim to faithfully represent human interpretive diversity.

Problem

Research questions and friction points this paper is trying to address.

annotator disagreement

demographic factors

subjective labeling

interpretive diversity

annotation variance

Innovation

Methods, ideas, or system contributions that make the work stand out.

annotator disagreement

demographic importance weighting

perspectivist modeling