Forecasting Rare Language Model Behaviors

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Standard language model evaluations often fail to detect rare, high-risk behaviors—such as instructions for illegal chemical synthesis or power-seeking actions—that only manifest under massive-scale deployment (e.g., billions of requests). This paper addresses this gap by proposing a trigger-probability–based extrapolation method, introducing the first scaling law model for elicitation probability. The approach enables prospective identification of hazardous behaviors—from small-scale testing (thousands of queries) to deployment scales spanning hundreds of millions to hundreds of billions of requests. Integrating probabilistic modeling, scaling-law analysis, and large-scale risk annotation, it accurately predicts first-occurrence thresholds for diverse high-risk behaviors across three orders of magnitude. This overcomes the fundamental limitation of conventional evaluation methods in covering long-tail failure modes, enabling pre-deployment detection and remediation of rare yet severe behavioral vulnerabilities.

Technology Category

Application Category

📝 Abstract

Standard language model evaluations can fail to capture risks that emerge only at deployment scale. For example, a model may produce safe responses during a small-scale beta test, yet reveal dangerous information when processing billions of requests at deployment. To remedy this, we introduce a method to forecast potential risks across orders of magnitude more queries than we test during evaluation. We make forecasts by studying each query's elicitation probability -- the probability the query produces a target behavior -- and demonstrate that the largest observed elicitation probabilities predictably scale with the number of queries. We find that our forecasts can predict the emergence of diverse undesirable behaviors -- such as assisting users with dangerous chemical synthesis or taking power-seeking actions -- across up to three orders of magnitude of query volume. Our work enables model developers to proactively anticipate and patch rare failures before they manifest during large-scale deployments.

Problem

Research questions and friction points this paper is trying to address.

Forecasting rare language model behaviors

Predicting risks at deployment scale

Proactively anticipating and patching failures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Forecasting rare model behaviors

Scaling elicitation probability analysis

Predicting diverse undesirable behaviors

🔎 Similar Papers

No similar papers found.