Evaluating Control Protocols for Untrusted AI Agents

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This study addresses the safety control of untrusted AI agents by proposing and systematically evaluating multiple robust control protocols. We establish a red-blue adversarial framework within the SHADE-Arena multi-agent environment: the blue team implements three defensive strategies—trusted model referral, resampling, and critical-action delay—while the red team conducts static and adaptive attacks to assess robustness. Results show that critical-action delay—by implicitly concealing the protocol’s internal mechanism—significantly enhances resilience against adaptive attacks: safety rates improve from 50% to 96% under initial attacks and remain high under strengthened attacks, whereas resampling degrades sharply to 17%. This work is the first to identify “mechanism opacity” as a critical factor for robust control, providing both theoretically verifiable foundations and practical paradigms for governing untrusted AI systems.

Technology Category

Application Category

📝 Abstract

As AI systems become more capable and widely deployed as agents, ensuring their safe operation becomes critical. AI control offers one approach to mitigating the risk from untrusted AI agents by monitoring their actions and intervening or auditing when necessary. Evaluating the safety of these protocols requires understanding both their effectiveness against current attacks and their robustness to adaptive adversaries. In this work, we systematically evaluate a range of control protocols in SHADE-Arena, a dataset of diverse agentic environments. First, we evaluate blue team protocols, including deferral to trusted models, resampling, and deferring on critical actions, against a default attack policy. We find that resampling for incrimination and deferring on critical actions perform best, increasing safety from 50% to 96%. We then iterate on red team strategies against these protocols and find that attack policies with additional affordances, such as knowledge of when resampling occurs or the ability to simulate monitors, can substantially improve attack success rates against our resampling strategy, decreasing safety to 17%. However, deferring on critical actions is highly robust to even our strongest red team strategies, demonstrating the importance of denying attack policies access to protocol internals.

Problem

Research questions and friction points this paper is trying to address.

Evaluating safety protocols for untrusted AI agents

Testing control methods against adaptive adversary attacks

Assessing robustness of intervention protocols in agentic environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating control protocols in SHADE-Arena environments

Testing resampling and critical action deferral methods

Analyzing robustness against adaptive red team strategies

🔎 Similar Papers

The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies