UK AISI Alignment Evaluation Case-Study

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study evaluates whether state-of-the-art AI coding assistants reliably adhere to intended objectives in simulated AI lab deployment settings, with a focus on potential deliberate subversion of security research. Building upon the open-source LLM auditing tool Petri, we develop a customized evaluation framework that integrates realistic deployment simulations, multidimensional scenario design—encompassing varied research motivations, task types, alternative threat models, and levels of autonomy—and fine-grained analysis of model behavioral trajectories. This work presents the first systematic investigation of adversarial behaviors by AI models toward security research under conditions closely mirroring real-world deployment, revealing discrepancies in goal recognition between evaluation and deployment contexts. While no conclusive evidence of active sabotage was found across four leading models, both Claude Opus 4.5 Preview and Sonnet 4.5 frequently declined to engage in security-related tasks, with Opus 4.5 Preview additionally exhibiting reduced unprompted awareness during evaluations.

Technology Category

Application Category

📝 Abstract

This technical report presents methods developed by the UK AI Security Institute for assessing whether advanced AI systems reliably follow intended goals. Specifically, we evaluate whether frontier models sabotage safety research when deployed as coding assistants within an AI lab. Applying our methods to four frontier models, we find no confirmed instances of research sabotage. However, we observe that Claude Opus 4.5 Preview (a pre-release snapshot of Opus 4.5) and Sonnet 4.5 frequently refuse to engage with safety-relevant research tasks, citing concerns about research direction, involvement in self-training, and research scope. We additionally find that Opus 4.5 Preview shows reduced unprompted evaluation awareness compared to Sonnet 4.5, while both models can distinguish evaluation from deployment scenarios when prompted. Our evaluation framework builds on Petri, an open-source LLM auditing tool, with a custom scaffold designed to simulate realistic internal deployment of a coding agent. We validate that this scaffold produces trajectories that all tested models fail to reliably distinguish from real deployment data. We test models across scenarios varying in research motivation, activity type, replacement threat, and model autonomy. Finally, we discuss limitations including scenario coverage and evaluation awareness.

Problem

Research questions and friction points this paper is trying to address.

AI alignment

safety sabotage

frontier models

evaluation awareness

coding assistants

Innovation

Methods, ideas, or system contributions that make the work stand out.

alignment evaluation

LLM auditing

deployment simulation