DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing preference alignment methods, such as RLHF and DPO, exhibit fragility in the presence of systematic human preference disagreements, as they optimize a single averaged objective, often leading to overfitting and increased tail risk. This work proposes DARC—a training-free, inference-time approach that, for the first time, incorporates an entropy-based risk constraint during decoding. By leveraging KL-divergence-driven distributionally robust optimization and risk-sensitive reranking, DARC explicitly models preference uncertainty and controls risk premiums. Experimental results demonstrate that DARC effectively reduces both preference disagreement and tail risk across multiple alignment benchmarks, maintaining robustness under high-noise and highly heterogeneous feedback conditions without compromising average response quality.

Technology Category

Application Category

📝 Abstract

Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We propose **Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC)**, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making. Given multiple preference samples or scalable disagreement proxies, DARC reranks candidates by maximizing a *KL-robust (entropic)* satisfaction objective, and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining. We provide theoretical characterization linking this decoding rule to principled pessimism and KL-based distributionally robust optimization. Experiments on alignment benchmarks show that DARC reduces disagreement and tail risk while maintaining competitive average quality under noisy, heterogeneous feedback.

Problem

Research questions and friction points this paper is trying to address.

preference alignment

human preference disagreement

mean-reward maximization

proxy over-optimization

heterogeneous feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

risk-constrained decoding

distributionally robust optimization

preference disagreement