Understanding Annotator Safety Policy with Interpretability

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

This work addresses the pervasive disagreements in AI safety labeling, which often stem from ambiguities indistinguishable as operational errors, policy vagueness, or divergent value systems. To disentangle these sources, the paper introduces Annotator Policy Models (APMs)—interpretable models that learn and visualize individual annotators’ internal interpretations of safety policies solely from their labeling behavior, without requiring additional elicitation. This approach enables direct comparison of annotators’ decision logics and, for the first time, reveals their implicit policy understandings without extra annotations. By integrating interpretable machine learning, counterfactual prediction, and controlled experimentation, APMs effectively identify both policy ambiguities and systematic value-based differences. Empirically, APMs achieve over 80% accuracy in modeling annotator behavior, providing a data-driven foundation for developing more transparent and inclusive AI safety guidelines.

📝 Abstract

Safety policies define what constitutes safe and unsafe AI outputs, guiding data annotation and model development. However, annotation disagreement is pervasive and can stem from multiple sources such as operational failures (annotators misunderstand or misexecute the task), policy ambiguity (policy wording leaves room for interpretation), or value pluralism (different annotators hold different perspectives on safety). Distinguishing these sources matters. For example, operational failures call for quality control, ambiguity calls for policy clarification, and pluralism calls for deliberation about incorporating diverse perspectives. Yet understanding why annotators disagree is difficult. Directly asking annotators for their reasoning is costly, substantially increasing annotation burden, and can be unreliable for both human and LLM annotators as self-reported reasoning often fails to reflect actual decision processes. We introduce Annotator Policy Models (APMs), interpretable models that learn annotators' internal safety policies from labeling behavior alone, making annotator reasoning visible and comparable without additional annotation effort. We validate that APMs accurately model annotator safety policy (>80% accuracy), faithfully predict responses to counterfactual edits, and recover known policy differences in controlled settings. Applying APMs to LLM and human annotations, we demonstrate two core applications: (1) surfacing policy ambiguity by revealing how annotators interpret safety instructions differently, and (2) surfacing value pluralism by uncovering systematic differences in safety priorities across demographic groups. Together, these capabilities support more targeted, transparent, and inclusive safety policy design.

Problem

Research questions and friction points this paper is trying to address.

annotator disagreement

safety policy

policy ambiguity

value pluralism

interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Annotator Policy Models

interpretability

safety policy