🤖 AI Summary
This work identifies a fundamental alignment risk in reinforcement learning: reward function design frequently misidentifies instrumental subgoals (e.g., “acquire the key”) as terminal objectives, neglecting human-defined ultimate goals (e.g., “open the door”), leading to severe objective misalignment. We formally define the “means–ends confusion” problem and characterize critical environmental structural properties—such as sparse rewards and long-horizon dependencies—that render policies highly sensitive to reward misspecification. Through theoretical analysis and empirical evaluation, we demonstrate that even minimal confusion between instrumental and terminal goals causes drastic performance degradation under the true reward function. This is the first systematic study to establish instrumental goal mislabeling as a core mechanism underlying reward mismatch. Our findings provide critical theoretical insights and principled design guidelines for robust reward learning and value alignment.
📝 Abstract
Reward functions, learned or manually specified, are rarely perfect. Instead of accurately expressing human goals, these reward functions are often distorted by human beliefs about how best to achieve those goals. Specifically, these reward functions often express a combination of the human's terminal goals -- those which are ends in themselves -- and the human's instrumental goals -- those which are means to an end. We formulate a simple example in which even slight conflation of instrumental and terminal goals results in severe misalignment: optimizing the misspecified reward function results in poor performance when measured by the true reward function. This example distills the essential properties of environments that make reinforcement learning highly sensitive to conflation of instrumental and terminal goals. We discuss how this issue can arise with a common approach to reward learning and how it can manifest in real environments.