🤖 AI Summary
In reinforcement learning, policy updates guided by human feedback frequently fail due to reward design bias, conflicting preferences, or scarce demonstration data; moreover, the opaque nature of learned policies impedes users’ ability to assess whether updates genuinely improve performance—constituting a core challenge in evaluating model updates within intelligent interfaces. This paper formally establishes “policy update evaluation” as a distinct research problem and proposes an explicit comparative demonstration framework—contrasting behavior under salient scenarios against random, same-scenario, or no-demonstration baselines—to reveal behavioral differences pre- and post-update. Evaluated in grid-world environments integrating RL training with controlled human experiments, our approach significantly improves users’ accuracy in discerning policy quality, mitigates unwarranted trust in feedback, and enables robust, cross-contextual trust calibration.
📝 Abstract
Reinforcement learning agents are often updated with human feedback, yet such updates can be unreliable: reward misspecification, preference conflicts, or limited data may leave policies unchanged or even worse. Because policies are difficult to interpret directly, users face the challenge of deciding whether an update has truly helped. We propose that assessing model updates -- not just a single model -- is a critical design challenge for intelligent user interfaces. In a controlled study, participants provided feedback to an agent in a gridworld and then compared its original and updated policies. We evaluated four strategies for communicating updates: no demonstration, same-context, random-context, and salient-contrast demonstrations designed to highlight informative differences. Salient-contrast demonstrations significantly improved participants' ability to detect when updates helped or harmed performance, mitigating participants' bias towards assuming that feedback is always beneficial, and supported better trust calibration across contexts.