π€ AI Summary
This work addresses the challenge of efficiently identifying high-risk inputs and accurately delegating them to expert models under limited computational budgets in large language model safety monitoring. The authors propose a cascaded framework featuring a novel Delegation Value (DV) probe that directly estimates the expected gain from expert correction. By integrating multiple hypothesis testing with calibrated thresholds, the method provides provable delegation rate guarantees under finite-sample settings while avoiding over-delegation. Notably, it enables streaming, instance-level decisions without requiring group-level labels and adaptively allocates budget based on input difficulty. Empirical evaluations across four safety datasets demonstrate that the proposed approach consistently and significantly outperforms uncertainty-based delegation strategies across various budget levels.
π Abstract
Monitoring LLM safety at scale requires balancing cost and accuracy: a cheap latent-space probe can screen every input, but hard cases should be escalated to a more expensive expert. Existing cascades delegate based on probe uncertainty, but uncertainty is a poor proxy for delegation benefit, as it ignores whether the expert would actually correct the error. To address this problem, we introduce Calibrate-Then-Delegate (CTD), a model-cascade approach that provides probabilistic guarantees on the computation cost while enabling instance-level (streaming) decisions. CTD builds on a novel delegation value (DV) probe, a lightweight model operating on the same internal representations as the safety probe that directly predicts the benefit of escalation. To enforce budget constraints, CTD calibrates a threshold on the DV signal using held-out data via multiple hypothesis testing, yielding finite-sample guarantees on the delegation rate. Evaluated on four safety datasets, CTD consistently outperforms uncertainty-based delegation at every budget level, avoids harmful over-delegation, and adapts budget allocation to input difficulty without requiring group labels.