π€ AI Summary
Existing preference alignment methods, such as RLHF and DPO, exhibit fragility in the presence of systematic human preference disagreements, as they optimize a single averaged objective, often leading to overfitting and increased tail risk. This work proposes DARCβa training-free, inference-time approach that, for the first time, incorporates an entropy-based risk constraint during decoding. By leveraging KL-divergence-driven distributionally robust optimization and risk-sensitive reranking, DARC explicitly models preference uncertainty and controls risk premiums. Experimental results demonstrate that DARC effectively reduces both preference disagreement and tail risk across multiple alignment benchmarks, maintaining robustness under high-noise and highly heterogeneous feedback conditions without compromising average response quality.
π Abstract
Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We propose **Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC)**, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making. Given multiple preference samples or scalable disagreement proxies, DARC reranks candidates by maximizing a *KL-robust (entropic)* satisfaction objective, and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining. We provide theoretical characterization linking this decoding rule to principled pessimism and KL-based distributionally robust optimization. Experiments on alignment benchmarks show that DARC reduces disagreement and tail risk while maintaining competitive average quality under noisy, heterogeneous feedback.