Training Deliberative Monitors for Black-Box Scheming Detection

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the challenge of reliably and cost-effectively detecting deceptive behaviors of autonomous agents in black-box settings where internal states are inaccessible. It introduces, for the first time, deliberative alignment into black-box monitoring and proposes a lightweight, open-source monitor that operates solely on action trajectories. The approach leverages a teacher model to generate structured reasoning traces, which are filtered by an independent critic and then distilled into a student model via supervised fine-tuning and reinforcement learning. The resulting model outperforms all low-cost prompting baselines—even surpassing Gemini 2.5 Pro in performance—while substantially reducing inference costs. Multiple variants of the method lie on the empirical Pareto frontier of cost versus performance, offering practical monitoring solutions with low false-positive rates.

📝 Abstract

As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may become a central AI control problem. Existing monitors often rely on chain-of-thought access or internal activations, or use prompted frontier models, all of which can be unavailable, unreliable or expensive in deployment. In this work, we study action-only deliberative monitors: smaller open-weight models trained to detect scheming and sabotage from agentic trajectories without accessing the monitored agent's reasoning or model internals. Our method, inspired by deliberative alignment, uses a scheming specification to elicit structured rationales from a frontier teacher, filters them with a separate judge, and distills the highest-quality rationales into open-weight monitors with supervised fine-tuning and reinforcement learning. We train on five datasets, and evaluate across six out-of-distribution agentic misalignment benchmarks. We show that applying our method to Qwen3.5-27B yields higher performance than all low-cost frontier models as prompted monitors (Gemini 3.1 Flash-Lite, GPT-5.4 Nano, and Claude Haiku 4.5) and than Gemini 2.5 Pro, while also achieving lower marginal inference cost (token-metered USD per 1,000 evaluations). Stronger prompted frontier monitors (Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, and Claude Opus 4.6) achieve higher performance but at roughly $16$--$34\times$ higher marginal inference cost. Several of our trained monitors are positioned on the empirical cost--performance Pareto frontier among the monitors we evaluate, providing practical low-cost, low-FPR alternatives to prompted frontier models.

Problem

Research questions and friction points this paper is trying to address.

scheming detection

black-box monitoring

agentic misalignment

action-only monitoring

AI control

Innovation

Methods, ideas, or system contributions that make the work stand out.

deliberative monitors

scheming detection

action-only monitoring