🤖 AI Summary
This paper addresses the temporal instability of human moral preferences in AI alignment—specifically, distinguishing legitimate moral evolution from noise-induced shifts arising from cognitive biases. Using kidney transplant allocation as a domain, it conducts multi-round pairwise comparison experiments coupled with interpretable AI modeling to systematically uncover substantial temporal inconsistency in human feedback (6–20% response contradictions on average) and consequential model drift, leading to time-dependent degradation in AI prediction accuracy. Contributions are threefold: (1) It empirically challenges the static preference assumption, revealing dual dynamics in moral preferences—instability in individual responses and drift in aggregate population models; (2) It proposes a “legitimate evolution vs. noise” discrimination framework; and (3) It advances a dynamic AI alignment paradigm explicitly designed for temporal evolution of human values.
📝 Abstract
Alignment methods in moral domains seek to elicit moral preferences of human stakeholders and incorporate them into AI. This presupposes moral preferences as static targets, but such preferences often evolve over time. Proper alignment of AI to dynamic human preferences should ideally account for"legitimate"changes to moral reasoning, while ignoring changes related to attention deficits, cognitive biases, or other arbitrary factors. However, common AI alignment approaches largely neglect temporal changes in preferences, posing serious challenges to proper alignment, especially in high-stakes applications of AI, e.g., in healthcare domains, where misalignment can jeopardize the trustworthiness of the system and yield serious individual and societal harms. This work investigates the extent to which people's moral preferences change over time, and the impact of such changes on AI alignment. Our study is grounded in the kidney allocation domain, where we elicit responses to pairwise comparisons of hypothetical kidney transplant patients from over 400 participants across 3-5 sessions. We find that, on average, participants change their response to the same scenario presented at different times around 6-20% of the time (exhibiting"response instability"). Additionally, we observe significant shifts in several participants'retrofitted decision-making models over time (capturing"model instability"). The predictive performance of simple AI models decreases as a function of both response and model instability. Moreover, predictive performance diminishes over time, highlighting the importance of accounting for temporal changes in preferences during training. These findings raise fundamental normative and technical challenges relevant to AI alignment, highlighting the need to better understand the object of alignment (what to align to) when user preferences change significantly over time.