Modeling Human Beliefs about AI Behavior for Scalable Oversight

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
As AI capabilities surpass human performance, human feedback becomes increasingly unreliable, leading to the failure of scalable oversight. Method: We formally model the role of human evaluators’ belief structures in value inference for the first time, introduce the “belief coverage” relaxation framework, and establish a novel paradigm for constructing coverage-aware belief models using large language models. Contribution: Theoretically, we derive sufficient conditions under which belief uncertainty vanishes. Practically, we provide formal guarantees and a viable pathway for robust supervision without requiring precise prior beliefs—thereby significantly enhancing the scalability and trustworthiness of alignment for high-capability AI systems.

Technology Category

Application Category

📝 Abstract
Contemporary work in AI alignment often relies on human feedback to teach AI systems human preferences and values. Yet as AI systems grow more capable, human feedback becomes increasingly unreliable. This raises the problem of scalable oversight: How can we supervise AI systems that exceed human capabilities? In this work, we propose to model the human evaluator's beliefs about the AI system's behavior to better interpret the human's feedback. We formalize human belief models and theoretically analyze their role in inferring human values. We then characterize the remaining ambiguity in this inference and conditions for which the ambiguity disappears. To mitigate reliance on exact belief models, we then introduce the relaxation of human belief model covering. Finally, we propose using foundation models to construct covering belief models, providing a new potential approach to scalable oversight.
Problem

Research questions and friction points this paper is trying to address.

Modeling human beliefs about AI behavior for scalable oversight.
Addressing unreliable human feedback as AI systems grow more capable.
Using foundation models to construct covering belief models for oversight.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model human beliefs about AI behavior
Relax human belief model covering
Use foundation models for scalable oversight