Calibrating Conservatism for Scalable Oversight

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the challenge of achieving scalable and effective safety control in autonomous agents operating in domains where direct human supervision is infeasible. It proposes Calibrated Collective Oversight (CCO), a novel framework that integrates conformal prediction into sequential decision-making by aggregating diverse auxiliary scoring functions to construct a conservative penalty term. Coupled with an online calibration mechanism, CCO provides finite-sample statistical guarantees without requiring distributional assumptions. The method preserves high-value behaviors while enabling timely intervention when cumulative concern exceeds a specified threshold. Empirical evaluations on the SWE-bench and MACHIAVELLI benchmarks demonstrate that CCO substantially reduces ethical violations, with empirical violation rates closely matching pre-specified nominal levels, all without compromising task reward performance.

📝 Abstract

Agentic AI systems capable of autonomous planning and extended environmental interaction pose a fundamental control problem: how can humans maintain meaningful oversight of systems that may exceed their own capabilities? Existing approaches to scalable oversight rely on complex assumptions, remain largely heuristic, or lack practical methods for sequential settings with statistical guarantees. We introduce Calibrated Collective Oversight (CCO), which aggregates diverse auxiliary scoring functions into a penalty measuring deviation from a conservative baseline. Inspired by Attainable Utility Preservation, CCO enables collective conservatism: actions face a penalty proportional to overseer concern, so high-utility actions are still selected when overseers find them unobjectionable and overridden only when concern accumulates. CCO calibrates this conservatism online using Conformal Decision Theory, ensuring that undesirable outcomes remain below a user-specified target threshold with finite-time bounds and no distributional assumptions. On a modified version of SWE-bench, weaker overseers successfully constrain an adversarially misaligned stronger agent; on MACHIAVELLI, CCO substantially reduces ethical violations while preserving reward. In both settings, empirical violation rates closely match the specified targets, as predicted by the theory.

Problem

Research questions and friction points this paper is trying to address.

scalable oversight

agentic AI

conservatism

human control

AI safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Calibrated Collective Oversight

Conformal Decision Theory

Scalable Oversight