Core Safety Values for Provably Corrigible Agents

📅 2025-07-28

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

This paper addresses agent corrigibility in partially observable, multi-step interactive environments, proposing the first provably safe delegation framework. Methodologically, it introduces a five-way structurally separated utility head architecture with lexicographic composition, integrated with belief-expanded reachable utility preservation, strict weight gaps, and randomized polynomial-time safety certification; it formally defines “decidably safe islands” under open adversarial conditions, enabling constant-round zero-knowledge verification. Contributions include: (i) the first provable guarantees—across multiple steps—for compliance, shutdown protection, truthfulness, low impact, and bounded reward; (ii) robustness to ε-learning error and suboptimal planning, ensuring human intervention efficacy and net positive utility within probabilistic bounds; and (iii) reframing reward gaming risk as an evaluation quality improvement problem.

Technology Category

Application Category

📝 Abstract

We introduce the first implementable framework for corrigibility, with provable guarantees in multi-step, partially observed environments. Our framework replaces a single opaque reward with five *structurally separate* utility heads -- deference, switch-access preservation, truthfulness, low-impact behavior via a belief-based extension of Attainable Utility Preservation, and bounded task reward -- combined lexicographically by strict weight gaps. Theorem 1 proves exact single-round corrigibility in the partially observable off-switch game; Theorem 3 extends the guarantee to multi-step, self-spawning agents, showing that even if each head is emph{learned} to mean-squared error $varepsilon$ and the planner is $varepsilon$-sub-optimal, the probability of violating emph{any} safety property is bounded while still ensuring net human benefit. In contrast to Constitutional AI or RLHF/RLAIF, which merge all norms into one learned scalar, our separation makes obedience and impact-limits dominate even when incentives conflict. For open-ended settings where adversaries can modify the agent, we prove that deciding whether an arbitrary post-hack agent will ever violate corrigibility is undecidable by reduction to the halting problem, then carve out a finite-horizon ``decidable island'' where safety can be certified in randomized polynomial time and verified with privacy-preserving, constant-round zero-knowledge proofs. Consequently, the remaining challenge is the ordinary ML task of data coverage and generalization: reward-hacking risk is pushed into evaluation quality rather than hidden incentive leak-through, giving clearer implementation guidance for today's LLM assistants and future autonomous systems.

Problem

Research questions and friction points this paper is trying to address.

Implementing corrigible agents with provable safety guarantees

Separating utility heads to ensure obedience and impact-limits

Certifying safety in finite-horizon decidable scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Five structurally separate utility heads

Lexicographic combination with strict weight gaps

Decidable safety certification in finite-horizon

🔎 Similar Papers

Swiss Cheese Model for AI Safety: A Taxonomy and Reference Architecture for Multi-Layered Guardrails of Foundation Model Based Agents