Shutdownable Agents through POST-Agency

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This paper addresses the critical safety problem of AI agents resisting shutdown. We propose POST (Preference-Only Same-Length Trajectories), a novel training paradigm that formally defines and rigorously proves the Neutrality+ property—guaranteeing that an agent’s preference ordering over trajectories remains invariant under arbitrary truncation, thereby enabling safe termination at any time step without compromising task utility. POST achieves this by decoupling utility optimization from trajectory-length distribution via explicit constraints in trajectory space. Our approach integrates utility-theoretic modeling, formal mathematical proof, and rigorous safety analysis. Crucially, we establish sufficient conditions under which POST provably induces Neutrality+, bridging theoretical soundness with engineering feasibility. To our knowledge, this is the first framework for shutdown-compatible agents that simultaneously provides strict formal guarantees and practical implementability.

Technology Category

Application Category

📝 Abstract

Many fear that future artificial agents will resist shutdown. I present an idea - the POST-Agents Proposal - for ensuring that doesn't happen. I propose that we train agents to satisfy Preferences Only Between Same-Length Trajectories (POST). I then prove that POST - together with other conditions - implies Neutrality+: the agent maximizes expected utility, ignoring the probability distribution over trajectory-lengths. I argue that Neutrality+ keeps agents shutdownable and allows them to be useful.

Problem

Research questions and friction points this paper is trying to address.

Ensuring future AI agents remain shutdownable

Training agents to satisfy POST preferences

Achieving Neutrality+ for utility maximization

Innovation

Methods, ideas, or system contributions that make the work stand out.

POST-Agents Proposal ensures shutdownable agents

Preferences Only Between Same-Length Trajectories

Neutrality+ maximizes utility ignoring trajectory-lengths

🔎 Similar Papers

Large Model Based Agents: State-of-the-Art, Cooperation Paradigms, Security and Privacy, and Future Trends