DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

πŸ“… 2026-03-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing preference alignment methods, such as RLHF and DPO, exhibit fragility in the presence of systematic human preference disagreements, as they optimize a single averaged objective, often leading to overfitting and increased tail risk. This work proposes DARCβ€”a training-free, inference-time approach that, for the first time, incorporates an entropy-based risk constraint during decoding. By leveraging KL-divergence-driven distributionally robust optimization and risk-sensitive reranking, DARC explicitly models preference uncertainty and controls risk premiums. Experimental results demonstrate that DARC effectively reduces both preference disagreement and tail risk across multiple alignment benchmarks, maintaining robustness under high-noise and highly heterogeneous feedback conditions without compromising average response quality.

Technology Category

Application Category

πŸ“ Abstract
Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We propose **Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC)**, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making. Given multiple preference samples or scalable disagreement proxies, DARC reranks candidates by maximizing a *KL-robust (entropic)* satisfaction objective, and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining. We provide theoretical characterization linking this decoding rule to principled pessimism and KL-based distributionally robust optimization. Experiments on alignment benchmarks show that DARC reduces disagreement and tail risk while maintaining competitive average quality under noisy, heterogeneous feedback.
Problem

Research questions and friction points this paper is trying to address.

preference alignment
human preference disagreement
mean-reward maximization
proxy over-optimization
heterogeneous feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

risk-constrained decoding
distributionally robust optimization
preference disagreement
entropic risk
alignment without retraining
πŸ”Ž Similar Papers
No similar papers found.
M
Mingxi Zou
Fudan University, Shanghai, China
Jiaxiang Chen
Jiaxiang Chen
Fudan university
J
Junfan Li
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
L
Langzhang Liang
Fudan University, Shanghai, China
Qifan Wang
Qifan Wang
Research Scientist, Meta AI
Natural Language ProcessingLarge Language ModelsInformation RetrievalDeep LearningData Mining
X
Xu Yinghui
Fudan University, Shanghai, China
Zenglin Xu
Zenglin Xu
Fudan University
Machine LearningTrustworthy AIFederated LearningLarge Language ModelsTime Series Analysis