🤖 AI Summary
Current AI governance relies on behavioral evaluations to verify model safety, yet this approach fails to access internal mechanisms, creating a structural disconnect between safety claims and supporting evidence. This work formally defines, for the first time, the “audit gap” and “fragile assurances,” exposing the fundamental limitations of behavioral evidence in verifying hidden objectives and long-term agent behavior. Through systematic analysis of 21 evaluation tools and integration of mechanistic interpretability techniques—such as linear probing and activation patching—the study proposes incorporating mechanistic evidence into regulatory frameworks. It further identifies behavioral proxy biases induced by geopolitical and industrial pressures and advocates for legal provisions that cap the evidentiary weight of behavioral assessments, advancing a technical policy pathway centered on mechanistic pre-deployment review.
📝 Abstract
This position paper argues that behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify. AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability; current assurance methodologies (primarily behavioural evaluations and red-teaming) are epistemically limited to observable model outputs and cannot verify the latent representations or long-horizon agentic behaviours these frameworks presume to regulate. We formalize this structural mismatch as the audit gap, the divergence between required and achievable verification access, and introduce the concept of fragile assurance to describe cases where the evidential structure does not support the asserted safety claim. Through an analysis of a 21-instrument inventory, we identify an incentive gradient where geopolitical and industrial pressures systematically reward surface-level behavioral proxies over deep structural verification. Finally, we propose a technical pivot: bounding the weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes, specifically linear probes, activation patching, and before/after-training comparisons.