🤖 AI Summary
This work addresses a critical challenge in AI model deployment: how to selectively abstain from predictions under uncertainty while providing rigorous, finite-sample risk control for trusted predictions. The authors propose the SCoRE framework, which introduces—for the first time—a general e-value–based risk control mechanism that integrates conformal inference with hypothesis testing. Requiring only data exchangeability and no modeling assumptions or uniform convergence conditions, SCoRE delivers finite-sample–guaranteed reliable decisions for any pretrained model and any user-specified bounded continuous risk. Notably, the method naturally accommodates distributional shifts. Empirical evaluations across drug discovery, health risk prediction, and large language models demonstrate that SCoRE effectively enforces strict control over positive-class risk.
📝 Abstract
In deploying artificial intelligence (AI) models, selective prediction offers the option to abstain from making a prediction when uncertain about model quality. To fulfill its promise, it is crucial to enforce strict and precise error control over cases where the model is trusted. We propose Selective Conformal Risk control with E-values (SCoRE), a new framework for deriving such decisions for any trained model and any user-defined, bounded and continuously-valued risk. SCoRE offers two types of guarantees on the risk among ``positive'' cases in which the system opts to trust the model. Built upon conformal inference and hypothesis testing ideas, SCoRE first constructs a class of (generalized) e-values, which are non-negative random variables whose product with the unknown risk has expectation no greater than one. Such a property is ensured by data exchangeability without requiring any modeling assumptions. Passing these e-values on to hypothesis testing procedures, we yield the binary trust decisions with finite-sample error control. SCoRE avoids the need of uniform concentration, and can be readily extended to settings with distribution shifts. We evaluate the proposed methods with simulations and demonstrate their efficacy through applications to error management in drug discovery, health risk prediction, and large language models.