Training ML Models with Predictable Failures

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

208K/year
🤖 AI Summary
Current evaluation datasets struggle to accurately estimate the risk of rare failures that machine learning models may encounter in deployment. This work proposes an extrapolation method for failure rates grounded in extreme value theory, leveraging the top-k largest failure scores observed in the evaluation set to predict failure rates at deployment scale. To address the inherent safety bias and the tendency of existing extrapolation estimators to overlook high-risk failure modes, the approach incorporates a predictability-aware loss function during fine-tuning. Experiments on the Password Game and GridWorld benchmarks demonstrate that the proposed method substantially reduces prediction error while preserving primary task performance, achieving safety levels comparable to those of supervised baselines.
📝 Abstract
Estimating how often an ML model will fail at deployment scale is central to pre-deployment safety assessment, but a feasible evaluation set is rarely large enough to observe the failures that matter. Jones et al. (2025) address this by extrapolating from the largest k failure scores in an evaluation set to predict deployment-scale failure rates. We give a finite-k decomposition of this estimator's forecast error and show that it has a built-in bias toward over-prediction in the typical case, which is the safety-favorable direction. This bias is offset when the evaluation set misses a rare high-failure mode that the deployment set contains, leaving the forecast to under-predict at deployment scale. We propose a fine-tuning objective, the forecastability loss, that addresses this failure mode. In two proof-of-concept experiments, a language-model password game and an RL gridworld, fine-tuning substantially reduces held-out forecast error while preserving primary-task capability and achieving safety similar to that of supervised baselines.
Problem

Research questions and friction points this paper is trying to address.

failure prediction
deployment-scale failure rate
evaluation set limitation
model safety assessment
rare failure modes
Innovation

Methods, ideas, or system contributions that make the work stand out.

forecastability loss
failure rate prediction
safety evaluation
model fine-tuning
extrapolation bias
🔎 Similar Papers
No similar papers found.