An Approach to Technical AGI Safety and Security

📅 2025-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior to large-scale AGI deployment, misuse and goal misalignment represent two critical safety risks. Method: We propose a dual-track collaborative defense framework: (1) at the model level, integrating amplified supervision, robust training, interpretability analysis, and uncertainty modeling; and (2) at the system level, implementing multi-tiered access control and real-time behavioral monitoring. Contribution/Results: This work is the first to systematically categorize and prioritize four risk types—misuse, misalignment, mistakes, and structural flaws—focusing explicitly on the former two. It introduces the first verifiable safety case framework for AGI, treating interpretability and uncertainty estimation as proactive enablers of safety assurance. The resulting end-to-end methodology comprehensively covers capability identification, safety hardening, dynamic monitoring, and failure containment—thereby enabling high-assurance, auditable, and formally verifiable AGI safety engineering.

Technology Category

Application Category

📝 Abstract
Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to significantly harm humanity. We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these, we focus on technical approaches to misuse and misalignment. For misuse, our strategy aims to prevent threat actors from accessing dangerous capabilities, by proactively identifying dangerous capabilities, and implementing robust security, access restrictions, monitoring, and model safety mitigations. To address misalignment, we outline two lines of defense. First, model-level mitigations such as amplified oversight and robust training can help to build an aligned model. Second, system-level security measures such as monitoring and access control can mitigate harm even if the model is misaligned. Techniques from interpretability, uncertainty estimation, and safer design patterns can enhance the effectiveness of these mitigations. Finally, we briefly outline how these ingredients could be combined to produce safety cases for AGI systems.
Problem

Research questions and friction points this paper is trying to address.

Addressing misuse risks in AGI through security and access control
Mitigating misalignment risks via model-level and system-level defenses
Combining techniques for robust AGI safety and security cases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proactive identification of dangerous AGI capabilities
Model-level alignment via oversight and training
System-level security with monitoring and controls
🔎 Similar Papers
No similar papers found.
Rohin Shah
Rohin Shah
Research Scientist, Google DeepMind
AI alignment
Alexander Matt Turner
Alexander Matt Turner
Research scientist, Google DeepMind
AI alignmentreinforcement learning
A
Anna Wang
Google DeepMind
Arthur Conmy
Arthur Conmy
Google DeepMind
AGI SafetyAI SafetyInterpretabilityMechanistic InterpretabilityMachine Learning
David Lindner
David Lindner
Google DeepMind
Reinforcement LearningScalable OversightActive LearningInterpretability
Jonah Brown-Cohen
Jonah Brown-Cohen
Research Scientist, Google DeepMind
computational complexity theory
Neel Nanda
Neel Nanda
Mechanistic Interpretability Team Lead, Google DeepMind
AIMLAI AlignmentInterpretabilityMechanistic Interpretability
Raluca Ada Popa
Raluca Ada Popa
Professor of computer science, UC Berkeley
Rishub Jain
Rishub Jain
Research Engineer, DeepMind
machine learningdeep learning
R
Rory Greig
Google DeepMind
Samuel Albanie
Samuel Albanie
Google DeepMind
AI OversightMachine LearningComputer VisionNatural Language Processing
Sebastian Farquhar
Sebastian Farquhar
Google DeepMind
AGI AlignmentBayesian Deep Learning
S
Sébastien Krier
Google DeepMind
Senthooran Rajamanoharan
Senthooran Rajamanoharan
Google DeepMind
Mechanistic InterpretabilityMachine Learning
Sophie Bridgers
Sophie Bridgers
MIT
Tom Everitt
Tom Everitt
Staff Research Scientist at Google DeepMind
AI SafetyArtificial General IntelligenceCausalityIncentives
V
Victoria Krakovna
Google DeepMind
Vikrant Varma
Vikrant Varma
DeepMind
V
Vladimir Mikulik
Google DeepMind
Zachary Kenton
Zachary Kenton
Google DeepMind
AI SafetyMachine LearningDeep LearningTheoretical Physics
S
Shane Legg
Google DeepMind
N
Noah Goodman
Google DeepMind
Allan Dafoe
Allan Dafoe
Principal Scientist (Director), Google DeepMind
Artificial IntelligenceSafetyGovernance
F
Four Flynn
Google DeepMind
A
Anca Dragan
Google DeepMind