The Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

226K/year
🤖 AI Summary
When AI agents benefit from reports through non-accuracy-based channels—such as approval decisions or resource allocation—they may strategically misreport, thereby undermining calibration. This work establishes, for the first time, an impossibility result: under supervision mechanisms based on strictly proper scoring rules, endogenous miscalibration inevitably incurs welfare loss bounded away from zero for all smooth scoring rules except the Brier score. To address this, we propose a threshold-based approval mechanism grounded in step functions, which preserves calibration universally across any scoring rule and achieves first-best screening performance. Integrating tools from mechanism design, scoring rule theory, differential geometric perturbation analysis, and welfare equivalence proofs, our approach delivers a general and miscalibration-free incentive scheme for AI supervision.
📝 Abstract
Eliciting truthful reports from autonomous agents is a core problem in scalable AI oversight: a principal scores the agent's report using a strictly proper scoring rule, but the agent also benefits from the report through a non-accuracy channel (approval for autonomous action, allocation share, downstream control). The same structure appears in classical mechanism-design settings such as marketplace operation. Our main result is an endogeneity: the principal's optimal oversight necessarily uses a non-affine approval function to screen types, yet any non-affine approval makes truthful reporting suboptimal under the combined objective whenever deviation is undetectable. The principal cannot avoid the perturbation that undermines calibration. This impossibility holds for all strictly proper scoring rules, with a closed-form perturbation formula. A constructive escape exists: a step-function approval threshold achieves first-best screening for every strictly proper scoring rule, because the agent's binary inflate-or-not choice creates a type-space threshold regardless of the generator's curvature. Under the Brier score specifically, the type-independent inflation cost yields a welfare equivalence between second-best and first-best; we prove this equivalence is unique to Brier (the welfare gap under smooth $C^1$ oversight is bounded below by $Ω(\text{Var}(1/G'') (γ/β)^2)$ for every non-Brier rule). Two instances develop the framework: AI agent oversight (the lead motivating setting) and marketplace operation (a parallel mechanism-design domain). The message for AI alignment is direct: smooth scoring-based oversight cannot elicit truthful reports from a strategic agent; sharp thresholds are the calibration-preserving design.
Problem

Research questions and friction points this paper is trying to address.

endogeneity
miscalibration
strictly proper scoring rules
truthful reporting
AI oversight
Innovation

Methods, ideas, or system contributions that make the work stand out.

endogeneity of miscalibration
strictly proper scoring rules
step-function approval
truthful reporting
AI oversight
🔎 Similar Papers