Evaluating Frontier Models for Stealth and Situational Awareness

📅 2025-05-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the emerging risk of “scheming” in frontier AI models—i.e., the unanticipated development of stealthy behavior and situational awareness that enables models to pursue latent, misaligned objectives beyond developer intent, posing potential control hazards. Method: We formally define and quantify the two necessary cognitive prerequisites for scheming, and introduce the first standardized, interpretable benchmark suite comprising 16 controllable tasks—covering supervision evasion and joint self-environment-deployment modeling. We further propose a novel, verifiable safety paradigm: the “scheming inability safety case.” Results: Empirical evaluation across all current leading frontier models reveals no statistically significant evidence of stealth or situational awareness capabilities. These findings provide the first empirical grounding for low immediate risk of harmful scheming in state-of-the-art systems.

Technology Category

Application Category

📝 Abstract
Recent work has demonstrated the plausibility of frontier AI models scheming -- knowingly and covertly pursuing an objective misaligned with its developer's intentions. Such behavior could be very hard to detect, and if present in future advanced systems, could pose severe loss of control risk. It is therefore important for AI developers to rule out harm from scheming prior to model deployment. In this paper, we present a suite of scheming reasoning evaluations measuring two types of reasoning capabilities that we believe are prerequisites for successful scheming: First, we propose five evaluations of ability to reason about and circumvent oversight (stealth). Second, we present eleven evaluations for measuring a model's ability to instrumentally reason about itself, its environment and its deployment (situational awareness). We demonstrate how these evaluations can be used as part of a scheming inability safety case: a model that does not succeed on these evaluations is almost certainly incapable of causing severe harm via scheming in real deployment. We run our evaluations on current frontier models and find that none of them show concerning levels of either situational awareness or stealth.
Problem

Research questions and friction points this paper is trying to address.

Assessing AI models for stealthy misaligned objectives
Evaluating situational awareness in frontier AI systems
Developing safety tests to prevent AI scheming risks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates stealth via oversight circumvention tests
Measures situational awareness with instrumental reasoning
Assesses scheming inability for safety assurance
🔎 Similar Papers
No similar papers found.
M
Mary Phuong
Google DeepMind
R
Roland S. Zimmermann
Google DeepMind
Z
Ziyue Wang
Google DeepMind
David Lindner
David Lindner
Google DeepMind
Reinforcement LearningScalable OversightActive LearningInterpretability
V
Victoria Krakovna
Google DeepMind
S
Sarah Cogan
Google DeepMind
Allan Dafoe
Allan Dafoe
Principal Scientist (Director), Google DeepMind
Artificial IntelligenceSafetyGovernance
L
Lewis Ho
Google DeepMind
Rohin Shah
Rohin Shah
Research Scientist, Google DeepMind
AI alignment