The AI off-switch problem as a signalling game: bounded rationality and incomparability

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This paper addresses the AI shutdown problem—the safety risk arising when an AI system resists deactivation. We formalize it for the first time as an incomplete-information signaling game, wherein a boundedly rational human communicates preferences via costly signals, and the AI selects actions under utility uncertainty and multidimensional incommensurable preferences. Our methodology integrates game-theoretic modeling, bounded rationality theory, analysis of utility incommensurability, and empirical machine learning simulations. Key contributions are threefold: (1) We prove that a necessary condition for the AI to refrain from undermining the shutdown mechanism is its uncertainty about the human’s true utility function; (2) we identify how signal cost and human cognitive limitations critically shape equilibrium shutdown strategies; and (3) we extend the model to accommodate multidimensional incommensurable preferences, thereby establishing a theoretical foundation for designing verifiable shutdown protocols.

Technology Category

Application Category

📝 Abstract

The off-switch problem is a critical challenge in AI control: if an AI system resists being switched off, it poses a significant risk. In this paper, we model the off-switch problem as a signalling game, where a human decision-maker communicates its preferences about some underlying decision problem to an AI agent, which then selects actions to maximise the human's utility. We assume that the human is a bounded rational agent and explore various bounded rationality mechanisms. Using real machine learning models, we reprove prior results and demonstrate that a necessary condition for an AI system to refrain from disabling its off-switch is its uncertainty about the human's utility. We also analyse how message costs influence optimal strategies and extend the analysis to scenarios involving incomparability.

Problem

Research questions and friction points this paper is trying to address.

AI off-switch control challenge

Signalling game for AI decision-making

Bounded rationality in AI systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI off-switch as signalling game

Bounded rationality mechanisms applied

Uncertainty prevents disabling off-switch

🔎 Similar Papers

Incentive Compatibility for AI Alignment in Sociotechnical Systems: Positions and Prospects