Adaptive auditing of AI systems with anytime-valid guarantees

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work addresses the challenge of conducting statistically rigorous evaluations of generative AI systems under limited annotated samples and highly adaptive testing scenarios. The authors propose a duality-based adaptive auditing framework that formulates auditing as a game between two dual null hypotheses and introduces a “test-as-bet” mechanism. By integrating Safe Anytime-Valid Inference (SAVI), the framework constructs simultaneous e-processes that guarantee valid inference at arbitrary stopping times. Theoretically, under strong auditing strategies, the dual hypotheses become asymptotically mutually exclusive, enabling global robustness certification with minimal samples. Empirical results demonstrate that the method rigorously controls Type I error and substantially outperforms fixed testing protocols—sometimes achieving statistically significant conclusions with as few as 20 samples.

📝 Abstract

A major bottleneck in characterizing the failure modes of generative AI systems is the cost and time of annotation and evaluation. Consequently, adaptive testing paradigms have gained popularity, where one opportunistically decides which cases and how many to annotate based on past results. While this framework is highly practical, its extreme flexibility makes it difficult to draw statistically rigorous conclusions, as it violates classical assumptions: the number of observations is typically limited (often 10 to 50 cases) and decisions regarding sampling and stopping are made in the midst of data collection rather than based a pre-specified rule. To characterize what statistical inferences can be drawn from highly adaptive audits, we introduce a hypothesis testing framework from two 'dueling' perspectives: (i) the model's null that asserts there is no failure mode with performance below a target threshold versus (ii) the auditor's null that asserts they have a sampling strategy that will uncover a failure mode. Leveraging Safe Anytime-Valid Inference (SAVI), we formalize the auditor as conducting 'testing by betting', which translates into simultaneous e-processes for testing the dueling null hypotheses. Furthermore, if the auditor is sufficiently powerful, we prove that these two hypotheses are asymptotically inverses of each other, in that passage of a stringent audit does in fact certify the AI system as being globally robust. Empirically, we demonstrate that our proposed testing procedures maintain anytime-valid type-I error control, outperform pre-specified testing methods, and can reach statistically rigorous conclusions sometimes with as few as 20 observations.

Problem

Research questions and friction points this paper is trying to address.

adaptive auditing

generative AI

failure modes

statistical inference

anytime-valid guarantees

Innovation

Methods, ideas, or system contributions that make the work stand out.

anytime-valid inference

adaptive auditing

e-process