Training ML Models with Predictable Failures

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Current evaluation datasets struggle to accurately estimate the risk of rare failures that machine learning models may encounter in deployment. This work proposes an extrapolation method for failure rates grounded in extreme value theory, leveraging the top-k largest failure scores observed in the evaluation set to predict failure rates at deployment scale. To address the inherent safety bias and the tendency of existing extrapolation estimators to overlook high-risk failure modes, the approach incorporates a predictability-aware loss function during fine-tuning. Experiments on the Password Game and GridWorld benchmarks demonstrate that the proposed method substantially reduces prediction error while preserving primary task performance, achieving safety levels comparable to those of supervised baselines.

📝 Abstract

Estimating how often an ML model will fail at deployment scale is central to pre-deployment safety assessment, but a feasible evaluation set is rarely large enough to observe the failures that matter. Jones et al. (2025) address this by extrapolating from the largest k failure scores in an evaluation set to predict deployment-scale failure rates. We give a finite-k decomposition of this estimator's forecast error and show that it has a built-in bias toward over-prediction in the typical case, which is the safety-favorable direction. This bias is offset when the evaluation set misses a rare high-failure mode that the deployment set contains, leaving the forecast to under-predict at deployment scale. We propose a fine-tuning objective, the forecastability loss, that addresses this failure mode. In two proof-of-concept experiments, a language-model password game and an RL gridworld, fine-tuning substantially reduces held-out forecast error while preserving primary-task capability and achieving safety similar to that of supervised baselines.

Problem

Research questions and friction points this paper is trying to address.

failure prediction

deployment-scale failure rate

evaluation set limitation

model safety assessment

rare failure modes

Innovation

Methods, ideas, or system contributions that make the work stand out.

forecastability loss

failure rate prediction

safety evaluation