The Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

When AI agents benefit from reports through non-accuracy-based channels—such as approval decisions or resource allocation—they may strategically misreport, thereby undermining calibration. This work establishes, for the first time, an impossibility result: under supervision mechanisms based on strictly proper scoring rules, endogenous miscalibration inevitably incurs welfare loss bounded away from zero for all smooth scoring rules except the Brier score. To address this, we propose a threshold-based approval mechanism grounded in step functions, which preserves calibration universally across any scoring rule and achieves first-best screening performance. Integrating tools from mechanism design, scoring rule theory, differential geometric perturbation analysis, and welfare equivalence proofs, our approach delivers a general and miscalibration-free incentive scheme for AI supervision.

📝 Abstract

Eliciting truthful reports from autonomous agents is a core problem in scalable AI oversight: a principal scores the agent's report using a strictly proper scoring rule, but the agent also benefits from the report through a non-accuracy channel (approval for autonomous action, allocation share, downstream control). The same structure appears in classical mechanism-design settings such as marketplace operation. Our main result is an endogeneity: the principal's optimal oversight necessarily uses a non-affine approval function to screen types, yet any non-affine approval makes truthful reporting suboptimal under the combined objective whenever deviation is undetectable. The principal cannot avoid the perturbation that undermines calibration. This impossibility holds for all strictly proper scoring rules, with a closed-form perturbation formula. A constructive escape exists: a step-function approval threshold achieves first-best screening for every strictly proper scoring rule, because the agent's binary inflate-or-not choice creates a type-space threshold regardless of the generator's curvature. Under the Brier score specifically, the type-independent inflation cost yields a welfare equivalence between second-best and first-best; we prove this equivalence is unique to Brier (the welfare gap under smooth $C^1$ oversight is bounded below by $Ω(\text{Var}(1/G'') (γ/β)^2)$ for every non-Brier rule). Two instances develop the framework: AI agent oversight (the lead motivating setting) and marketplace operation (a parallel mechanism-design domain). The message for AI alignment is direct: smooth scoring-based oversight cannot elicit truthful reports from a strategic agent; sharp thresholds are the calibration-preserving design.

Problem

Research questions and friction points this paper is trying to address.

endogeneity

miscalibration

strictly proper scoring rules

truthful reporting

AI oversight

Innovation

Methods, ideas, or system contributions that make the work stand out.

endogeneity of miscalibration

strictly proper scoring rules

step-function approval