Monitoring Monitorability

📅 2025-12-20

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This paper addresses the vulnerability of chain-of-thought (CoT) monitoring in the secure deployment of AI systems, introducing the novel concept of “monitorability” and a systematic evaluation framework. Methodologically, it establishes a three-dimensional assessment paradigm—intervention, process, and outcome—and designs a multi-level monitoring test suite, complemented by reasoning-effort control experiments and monitor-agent co-scaling analysis. Key contributions include: (i) the first quantitative validation that longer CoTs are more monitorable and RL fine-tuning does not significantly degrade monitorability; (ii) empirical evidence that small models with high reasoning effort can outperform large models with low effort in monitoring accuracy; (iii) findings that mainstream models exhibit only moderate monitorability, which scales steeply with both monitor computational capacity and CoT visibility; and (iv) practical, deployable strategies to enhance monitorability.

Technology Category

Application Category

📝 Abstract

Observability into the decision making of modern AI systems may be required to safely deploy increasingly capable agents. Monitoring the chain-of-thought (CoT) of today's reasoning models has proven effective for detecting misbehavior. However, this "monitorability" may be fragile under different training procedures, data sources, or even continued system scaling. To measure and track monitorability, we propose three evaluation archetypes (intervention, process, and outcome-property) and a new monitorability metric, and introduce a broad evaluation suite. We demonstrate that these evaluations can catch simple model organisms trained to have obfuscated CoTs, and that CoT monitoring is more effective than action-only monitoring in practical settings. We compare the monitorability of various frontier models and find that most models are fairly, but not perfectly, monitorable. We also evaluate how monitorability scales with inference-time compute, reinforcement learning optimization, and pre-training model size. We find that longer CoTs are generally more monitorable and that RL optimization does not materially decrease monitorability even at the current frontier scale. Notably, we find that for a model at a low reasoning effort, we could instead deploy a smaller model at a higher reasoning effort (thereby matching capabilities) and obtain a higher monitorability, albeit at a higher overall inference compute cost. We further investigate agent-monitor scaling trends and find that scaling a weak monitor's test-time compute when monitoring a strong agent increases monitorability. Giving the weak monitor access to CoT not only improves monitorability, but it steepens the monitor's test-time compute to monitorability scaling trend. Finally, we show we can improve monitorability by asking models follow-up questions and giving their follow-up CoT to the monitor.

Problem

Research questions and friction points this paper is trying to address.

Assesses AI system monitorability under varying training and scaling conditions.

Evaluates chain-of-thought monitoring effectiveness compared to action-only approaches.

Investigates how monitorability scales with compute, model size, and optimization methods.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating monitorability with three archetypes and new metric

Using chain-of-thought monitoring to detect misbehavior effectively

Improving monitorability via follow-up questions and compute scaling

🔎 Similar Papers

On the Need to Monitor Continuous Integration Practices - An Empirical Study