🤖 AI Summary
In RLHF, misalignment between human preference expressions and algorithmic modeling assumptions—e.g., the Bradley–Terry model—leads to inaccurate reward functions. Method: We propose “Human Modeling Consistency Intervention,” a novel paradigm that applies three controllable behavioral interventions—optimized information presentation, model feedback guidance, and question structure reconstruction—to improve the fit of preference data to standard preference models without altering underlying human preferences. Contribution/Results: Across three empirical studies—including human-subject experiments, preference modeling analysis, and RLHF protocol evaluation—interventions significantly enhanced model consistency (p < 0.01). The resulting reward functions demonstrated robust improvements in alignment quality and cross-task generalization. This work is the first to systematically integrate human-centered design principles into the RLHF data collection pipeline, establishing a scalable methodological foundation for trustworthy preference learning.
📝 Abstract
Designing a reinforcement learning from human feedback (RLHF) algorithm to approximate a human's unobservable reward function requires assuming, implicitly or explicitly, a model of human preferences. A preference model that poorly describes how humans generate preferences risks learning a poor approximation of the human's reward function. In this paper, we conduct three human studies to asses whether one can influence the expression of real human preferences to more closely conform to a desired preference model. Importantly, our approach does not seek to alter the human's unobserved reward function. Rather, we change how humans use this reward function to generate preferences, such that they better match whatever preference model is assumed by a particular RLHF algorithm. We introduce three interventions: showing humans the quantities that underlie a preference model, which is normally unobservable information derived from the reward function; training people to follow a specific preference model; and modifying the preference elicitation question. All intervention types show significant effects, providing practical tools to improve preference data quality and the resultant alignment of the learned reward functions. Overall we establish a novel research direction in model alignment: designing interfaces and training interventions to increase human conformance with the modeling assumptions of the algorithm that will learn from their input.