🤖 AI Summary
Under prevalence shift, conventional evaluation metrics struggle to detect the risk of erroneously releasing high-risk patients. This work proposes a leakage-aware deployment auditing framework that directly quantifies the proportion of event-positive patients being released through a three-stage pipeline: prevalence correction, conformal calibration, and release safety assessment, thereby characterizing the trade-off between safety and manual review burden. Introducing, for the first time, an audit mechanism focused on release-end risk, the method integrates conformal prediction, stratified auditing, and class-level analysis to explicitly assess how diagnostic label scarcity impacts safe release decisions. In a retrospective pilot study on non-small cell lung cancer (NSCLC), the framework reveals that low review rates may obscure substantial risks of misrelease and demonstrates that current event labeling remains insufficient to support safe low-review deployment.
📝 Abstract
Conformal triage converts predictive scores into deployment actions that either release a case, flag it for urgent attention, or defer it to human review. Under prevalence shift, however, the usual summaries of marginal coverage and human-review rate can miss the safety-critical question of whether patients who truly experience the target event are released without review. To address this gap, we introduce a leakage-aware deployment audit for release-side conformal triage. It first assigns target subjects to three non-overlapping roles: prevalence correction, conformal calibration, and held-out release-safety evaluation. This separation then lets the audit evaluate release directly: how many event-positive patients are cleared without review, whether the pilot has enough event labels for calibration, and how the safety-review trade-off shifts. Applying this audit to a retrospective NSCLC pilot shows why lower review can be misleading: after prevalence correction, the pooled conformal branch lowers review by releasing more patients, some of whom are event-positive. Within the audit, the classwise branch acts as a scarcity diagnostic: the pilot has too few event labels to certify safe low-review release.