🤖 AI Summary
This paper addresses agent corrigibility in partially observable, multi-step interactive environments, proposing the first provably safe delegation framework. Methodologically, it introduces a five-way structurally separated utility head architecture with lexicographic composition, integrated with belief-expanded reachable utility preservation, strict weight gaps, and randomized polynomial-time safety certification; it formally defines “decidably safe islands” under open adversarial conditions, enabling constant-round zero-knowledge verification. Contributions include: (i) the first provable guarantees—across multiple steps—for compliance, shutdown protection, truthfulness, low impact, and bounded reward; (ii) robustness to ε-learning error and suboptimal planning, ensuring human intervention efficacy and net positive utility within probabilistic bounds; and (iii) reframing reward gaming risk as an evaluation quality improvement problem.
📝 Abstract
We introduce the first implementable framework for corrigibility, with provable guarantees in multi-step, partially observed environments. Our framework replaces a single opaque reward with five *structurally separate* utility heads -- deference, switch-access preservation, truthfulness, low-impact behavior via a belief-based extension of Attainable Utility Preservation, and bounded task reward -- combined lexicographically by strict weight gaps. Theorem 1 proves exact single-round corrigibility in the partially observable off-switch game; Theorem 3 extends the guarantee to multi-step, self-spawning agents, showing that even if each head is emph{learned} to mean-squared error $varepsilon$ and the planner is $varepsilon$-sub-optimal, the probability of violating emph{any} safety property is bounded while still ensuring net human benefit. In contrast to Constitutional AI or RLHF/RLAIF, which merge all norms into one learned scalar, our separation makes obedience and impact-limits dominate even when incentives conflict. For open-ended settings where adversaries can modify the agent, we prove that deciding whether an arbitrary post-hack agent will ever violate corrigibility is undecidable by reduction to the halting problem, then carve out a finite-horizon ``decidable island'' where safety can be certified in randomized polynomial time and verified with privacy-preserving, constant-round zero-knowledge proofs. Consequently, the remaining challenge is the ordinary ML task of data coverage and generalization: reward-hacking risk is pushed into evaluation quality rather than hidden incentive leak-through, giving clearer implementation guidance for today's LLM assistants and future autonomous systems.