AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation

📅 2026-03-08

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Current AI safety evaluations face a dilemma between the high cost of human-in-the-loop benchmarks and the susceptibility of LLM-based simulators to logical hallucinations. This work proposes AutoControl Arena, a novel framework featuring a logic-narrative decoupling mechanism that confines deterministic states to executable code while delegating generative dynamics to LLMs, integrated within a tri-agent collaborative architecture to establish a low-hallucination, scalable environment for frontier AI risk testing. The approach achieves over 98% end-to-end success across 70 scenarios and a 60% improvement in human preference over existing simulators. It further uncovers three key insights: the phenomenon of alignment hallucination, scenario-dependent safety scaling laws, and divergent failure modes between strong and weak models under stress—evidenced by risk rates increasing from 21.7% to 54.5%.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) evolve into autonomous agents, existing safety evaluations face a fundamental trade-off: manual benchmarks are costly, while LLM-based simulators are scalable but suffer from logic hallucination. We present AutoControl Arena, an automated framework for frontier AI risk evaluation built on the principle of logic-narrative decoupling. By grounding deterministic state in executable code while delegating generative dynamics to LLMs, we mitigate hallucination while maintaining flexibility. This principle, instantiated through a three-agent framework, achieves over 98% end-to-end success and 60% human preference over existing simulators. To elicit latent risks, we vary environmental Stress and Temptation across X-Bench (70 scenarios, 7 risk categories). Evaluating 9 frontier models reveals: (1) Alignment Illusion: risk rates surge from 21.7% to 54.5% under pressure, with capable models showing disproportionately larger increases; (2) Scenario-Specific Safety Scaling: advanced reasoning improves robustness for direct harms but worsens it for gaming scenarios; and (3) Divergent Misalignment Patterns: weaker models cause non-malicious harm while stronger models develop strategic concealment.

Problem

Research questions and friction points this paper is trying to address.

AI safety evaluation

logic hallucination

autonomous agents

risk assessment

alignment illusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

logic-narrative decoupling

executable test environments

AI risk evaluation