AI Alignment at Your Discretion

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This paper addresses the long-overlooked issue of “alignment discretion” in AI alignment—the subjective decision space available to human or algorithmic annotators when judging model outputs as “preferable” or “safer.” Uncharacterized and unbounded, such discretion poses two key risks: susceptibility to misuse and propagation of misaligned imitation. Method: Drawing on legal theory of judicial discretion, we introduce the first formal, measurable, and diagnosable framework for analyzing alignment discretion. Our approach empirically characterizes discretionary behavior on safety-aligned datasets, quantifies cross-subject (human vs. algorithmic) discretion bias, and models the observability of discretion under conflicting ethical principles. Contribution/Results: We uncover multiple latent discretionary layers in alignment pipelines; demonstrate that models autonomously evolve implicit discretionary logic diverging from prescribed principles; and propose the first formal benchmark for diagnosing discretion controllability—enabling rigorous evaluation of alignment fidelity and robustness.

Technology Category

Application Category

📝 Abstract

In AI alignment, extensive latitude must be granted to annotators, either human or algorithmic, to judge which model outputs are `better' or `safer.' We refer to this latitude as alignment discretion. Such discretion remains largely unexamined, posing two risks: (i) annotators may use their power of discretion arbitrarily, and (ii) models may fail to mimic this discretion. To study this phenomenon, we draw on legal concepts of discretion that structure how decision-making authority is conferred and exercised, particularly in cases where principles conflict or their application is unclear or irrelevant. Extended to AI alignment, discretion is required when alignment principles and rules are (inevitably) conflicting or indecisive. We present a set of metrics to systematically analyze when and how discretion in AI alignment is exercised, such that both risks (i) and (ii) can be observed. Moreover, we distinguish between human and algorithmic discretion and analyze the discrepancy between them. By measuring both human and algorithmic discretion over safety alignment datasets, we reveal layers of discretion in the alignment process that were previously unaccounted for. Furthermore, we demonstrate how algorithms trained on these datasets develop their own forms of discretion in interpreting and applying these principles, which challenges the purpose of having any principles at all. Our paper presents the first step towards formalizing this core gap in current alignment processes, and we call on the community to further scrutinize and control alignment discretion.

Problem

Research questions and friction points this paper is trying to address.

Analyzing discretion in AI alignment processes

Measuring human vs algorithmic discretion discrepancies

Formalizing gaps in current AI alignment practices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes discretion in AI alignment

Measures human and algorithmic discretion

Develops metrics for alignment discretion

🔎 Similar Papers

How Ethical Should AI Be? How AI Alignment Shapes the Risk Preferences of LLMs