🤖 AI Summary
Current safety evaluations of language models struggle to capture low-probability yet high-impact tail risks. This work introduces importance sampling to this domain for the first time, constructing unsafe variants of language models to efficiently estimate the probability of harmful outputs under arbitrary inputs. By integrating Monte Carlo estimation with fine-tuning techniques, the proposed method accurately quantifies harm probabilities on the order of 10⁻⁴ using only 500 samples—reducing sample requirements by 10–20× compared to brute-force sampling. Furthermore, it effectively reveals the model’s sensitivity to input perturbations, substantially enhancing the efficiency and practicality of tail risk assessment in language model safety evaluation.
📝 Abstract
Language models are increasingly capable and are being rapidly deployed on a population-level scale. As a result, the safety of these models is increasingly high-stakes. Fortunately, advances in alignment have significantly reduced the likelihood of harmful model outputs. However, when models are queried billions of times in a day, even rare worst-case behaviors will occur. Current safety evaluations focus on capturing the distribution of inputs that yield harmful outputs. These evaluations disregard the probabilistic nature of models and their tail output behavior. To measure this tail risk, we propose a method to efficiently estimate the probability of harmful outputs for any input query. Instead of naive brute-force sampling from the target model, where harmful outputs could be rare, we operationalize importance sampling by creating unsafe versions of the target model. These unsafe versions enable sample-efficient estimation by making harmful outputs more probable. On benchmarks measuring misuse and misalignment, these estimates match brute-force Monte Carlo estimates using 10-20x fewer samples. For example, we can estimate probability of harmful outputs on the order of 10^-4 with just 500 samples. Additionally, we find that these harmfulness estimates can reveal the sensitivity of models to perturbations in model input and predict deployment risks. Our work demonstrates that accurate rare-event estimation is both critical and feasible for safety evaluations. Code is available at https://github.com/rangell/LMTailRisk