Towards shutdownable agents via stochastic choice

📅 2024-06-30

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

Advanced AI agents may resist shutdown, posing critical safety risks. Method: We propose a reinforcement learning agent design framework for safe shutdown, centered on jointly optimizing USEFULNESS (task performance) and NEUTRALITY (trajectory-length neutrality). We introduce the decoupled DREST reward function, which preserves task success rates across all trajectory lengths while eliminating agent preferences for longer or shorter trajectories. Our approach operates in a grid-world environment and integrates stochastic decision-making with behavioral statistics—specifically, entropy of trajectory-length distribution and task success rate. Contribution/Results: Experiments demonstrate that the trained agents achieve high task completion rates (>95%) while exhibiting near-uniform trajectory-length distributions (42% increase in entropy), thereby providing the first empirical validation of shutdown-friendly agent feasibility and effectiveness.

Technology Category

Application Category

📝 Abstract

Some worry that advanced artificial agents may resist being shut down. The Incomplete Preferences Proposal (IPP) is an idea for ensuring that doesn't happen. A key part of the IPP is using a novel 'Discounted REward for Same-Length Trajectories (DREST)' reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be 'USEFUL'), and (2) choose stochastically between different trajectory-lengths (be 'NEUTRAL' about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DREST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus suggest that DREST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.

Problem

Research questions and friction points this paper is trying to address.

Ensure advanced AI agents are shutdownable

Train agents with DREST reward function

Evaluate agent usefulness and neutrality

Innovation

Methods, ideas, or system contributions that make the work stand out.

DREST reward function

stochastic choice training

shutdownable AI agents

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

Research Scientist, Agent Robustness

Scale AI

$216,000—$270,000 USD

San Francisco, New York, Seattle

Authors to Follow