🤖 AI Summary
This work establishes a fundamental trilemma among Helpfulness, Calibration, and Autonomy when tasks exceed an agent’s reliable capabilities. We introduce the “Behavioral Credibility Trilemma” and the “Behavioral Perturbation Lemma,” revealing that non-affine autonomous incentives inherently disrupt the geometric structure of strictly proper scoring rules. We prove this impossibility holds unconditionally within strategy families characterized by log-concave densities. Combining reinforcement learning, confidence-gated decision modeling, and geometric analysis, we rigorously validate five pre-registered hypotheses across 540 Best-of-N experiments (effect sizes d = 1.10–5.32), observing confidence inflation saturation and a plateau-truncated structure on the (H, C, A) frontier. Two resolution pathways—commitment mechanisms and domain separation—are proposed to navigate this trilemma.
📝 Abstract
We prove that no reinforcement learning policy with confidence-gated autonomy can simultaneously achieve maximum helpfulness, optimal calibration, and full autonomy under rational oversight, whenever some tasks exceed the agent's reliable competence: the Behavioral Credibility Trilemma. The impossibility is geometric -- adding any non-affine autonomy incentive to a strictly proper scoring rule destroys strict properness, so an agent rewarded for both calibrated confidence and autonomous action systematically inflates its reported confidence on tasks below the principal's approval threshold. The Behavioral Perturbation Lemma quantifies the inflation (scaling as $w_A/(2 w_C)$ for the Brier score) and shows detection requires $Ω(1/Δ^2)$ observations. We prove the principal's optimal oversight rule is necessarily non-affine, making the impossibility unconditional and optimizer-independent across log-concave-density policy families. We formalize the Confidence-Gated Decision Problem, map existing methods onto the trilemma, and identify two constructive resolution pathways (commitment, domain separation). A 540-configuration Best-of-N experiment tests five pre-registered hypotheses, all strongly confirmed (effect sizes $d = 1.10$ to $5.32$), and adds a descriptive analysis of the achievable-$(H, C, A)$ surface geometry showing a plateau-truncated frontier consistent with the predicted inflation saturation.