PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the limitation of prevailing AI safety frameworks that reduce harmfulness to a binary judgment, failing to capture substantive human disagreements in borderline cases. To this end, the paper introduces PluriHarms, a novel benchmark that systematically characterizes human judgments along two dimensions: harmfulness (from harmless to harmful) and agreement (from consensus to disagreement). Leveraging 150 high-disagreement prompts and 15,000 multidimensional annotations enriched with demographic, psychological, and content features, the study employs large-scale human annotation, psychometric analysis, and personalized modeling. Findings reveal that both contextual risk and individual traits significantly shape perceptions of harmfulness. While personalized models improve prediction accuracy, overall performance remains limited, underscoring the need to shift from consensus-driven safety paradigms toward frameworks that embrace pluralistic coexistence.

Technology Category

Application Category

📝 Abstract

Current AI safety frameworks, which often treat harmfulness as binary, lack the flexibility to handle borderline cases where humans meaningfully disagree. To build more pluralistic systems, it is essential to move beyond consensus and instead understand where and why disagreements arise. We introduce PluriHarms, a benchmark designed to systematically study human harm judgments across two key dimensions -- the harm axis (benign to harmful) and the agreement axis (agreement to disagreement). Our scalable framework generates prompts that capture diverse AI harms and human values while targeting cases with high disagreement rates, validated by human data. The benchmark includes 150 prompts with 15,000 ratings from 100 human annotators, enriched with demographic and psychological traits and prompt-level features of harmful actions, effects, and values. Our analyses show that prompts that relate to imminent risks and tangible harms amplify perceived harmfulness, while annotator traits (e.g., toxicity experience, education) and their interactions with prompt content explain systematic disagreement. We benchmark AI safety models and alignment methods on PluriHarms, finding that while personalization significantly improves prediction of human harm judgments, considerable room remains for future progress. By explicitly targeting value diversity and disagreement, our work provides a principled benchmark for moving beyond"one-size-fits-all"safety toward pluralistically safe AI.

Problem

Research questions and friction points this paper is trying to address.

AI safety

harm judgment

value pluralism

human disagreement

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

pluralistic AI safety

human disagreement

harm benchmark