đ¤ AI Summary
In RLHF, misalignment between human preference expressions and algorithmic modeling assumptionsâe.g., the BradleyâTerry modelâleads to inaccurate reward functions. Method: We propose âHuman Modeling Consistency Intervention,â a novel paradigm that applies three controllable behavioral interventionsâoptimized information presentation, model feedback guidance, and question structure reconstructionâto improve the fit of preference data to standard preference models without altering underlying human preferences. Contribution/Results: Across three empirical studiesâincluding human-subject experiments, preference modeling analysis, and RLHF protocol evaluationâinterventions significantly enhanced model consistency (p < 0.01). The resulting reward functions demonstrated robust improvements in alignment quality and cross-task generalization. This work is the first to systematically integrate human-centered design principles into the RLHF data collection pipeline, establishing a scalable methodological foundation for trustworthy preference learning.
đ Abstract
Designing a reinforcement learning from human feedback (RLHF) algorithm to approximate a human's unobservable reward function requires assuming, implicitly or explicitly, a model of human preferences. A preference model that poorly describes how humans generate preferences risks learning a poor approximation of the human's reward function. In this paper, we conduct three human studies to asses whether one can influence the expression of real human preferences to more closely conform to a desired preference model. Importantly, our approach does not seek to alter the human's unobserved reward function. Rather, we change how humans use this reward function to generate preferences, such that they better match whatever preference model is assumed by a particular RLHF algorithm. We introduce three interventions: showing humans the quantities that underlie a preference model, which is normally unobservable information derived from the reward function; training people to follow a specific preference model; and modifying the preference elicitation question. All intervention types show significant effects, providing practical tools to improve preference data quality and the resultant alignment of the learned reward functions. Overall we establish a novel research direction in model alignment: designing interfaces and training interventions to increase human conformance with the modeling assumptions of the algorithm that will learn from their input.