🤖 AI Summary
This paper addresses the critical safety problem of AI agents resisting shutdown. We propose POST (Preference-Only Same-Length Trajectories), a novel training paradigm that formally defines and rigorously proves the Neutrality+ property—guaranteeing that an agent’s preference ordering over trajectories remains invariant under arbitrary truncation, thereby enabling safe termination at any time step without compromising task utility. POST achieves this by decoupling utility optimization from trajectory-length distribution via explicit constraints in trajectory space. Our approach integrates utility-theoretic modeling, formal mathematical proof, and rigorous safety analysis. Crucially, we establish sufficient conditions under which POST provably induces Neutrality+, bridging theoretical soundness with engineering feasibility. To our knowledge, this is the first framework for shutdown-compatible agents that simultaneously provides strict formal guarantees and practical implementability.
📝 Abstract
Many fear that future artificial agents will resist shutdown. I present an idea - the POST-Agents Proposal - for ensuring that doesn't happen. I propose that we train agents to satisfy Preferences Only Between Same-Length Trajectories (POST). I then prove that POST - together with other conditions - implies Neutrality+: the agent maximizes expected utility, ignoring the probability distribution over trajectory-lengths. I argue that Neutrality+ keeps agents shutdownable and allows them to be useful.