🤖 AI Summary
This paper addresses the AI shutdown problem—the safety risk arising when an AI system resists deactivation. We formalize it for the first time as an incomplete-information signaling game, wherein a boundedly rational human communicates preferences via costly signals, and the AI selects actions under utility uncertainty and multidimensional incommensurable preferences. Our methodology integrates game-theoretic modeling, bounded rationality theory, analysis of utility incommensurability, and empirical machine learning simulations. Key contributions are threefold: (1) We prove that a necessary condition for the AI to refrain from undermining the shutdown mechanism is its uncertainty about the human’s true utility function; (2) we identify how signal cost and human cognitive limitations critically shape equilibrium shutdown strategies; and (3) we extend the model to accommodate multidimensional incommensurable preferences, thereby establishing a theoretical foundation for designing verifiable shutdown protocols.
📝 Abstract
The off-switch problem is a critical challenge in AI control: if an AI system resists being switched off, it poses a significant risk. In this paper, we model the off-switch problem as a signalling game, where a human decision-maker communicates its preferences about some underlying decision problem to an AI agent, which then selects actions to maximise the human's utility. We assume that the human is a bounded rational agent and explore various bounded rationality mechanisms. Using real machine learning models, we reprove prior results and demonstrate that a necessary condition for an AI system to refrain from disabling its off-switch is its uncertainty about the human's utility. We also analyse how message costs influence optimal strategies and extend the analysis to scenarios involving incomparability.