Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical limitation of existing evaluation methods, which severely underestimate the safety risks of large language models under realistic large-scale parallel adversarial attacks by relying on single or small-scale prompt testing. To bridge this gap, we propose SABER, a scaling-aware risk estimation framework that, for the first time, models the per-sample attack success probability using a Beta-Bernoulli conjugate prior and derives an analytically tractable scaling law. This enables accurate extrapolation of the success rate under large-scale Best-of-N attacks from as few as 100 samples. Our method uncovers a nonlinear amplification of risk under parallel adversarial pressure and achieves a mean absolute error of only 1.66 in predicting ASR@1000—reducing error by 86.2% compared to baselines—thereby substantially improving both the accuracy and efficiency of safety evaluations.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting, which underestimates real-world risk. In practice, attackers can exploit large-scale parallel sampling to repeatedly probe a model until a harmful response is produced. While recent work shows that attack success increases with repeated sampling, principled methods for predicting large-scale adversarial risk remain limited. We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling. We model sample-level success probabilities using a Beta distribution, the conjugate prior of the Bernoulli distribution, and derive an analytic scaling law that enables reliable extrapolation of large-N attack success rates from small-budget measurements. Using only n=100 samples, our anchored estimator predicts ASR@1000 with a mean absolute error of 1.66, compared to 12.04 for the baseline, which is an 86.2% reduction in estimation error. Our results reveal heterogeneous risk scaling profiles and show that models appearing robust under standard evaluation can experience rapid nonlinear risk amplification under parallel adversarial pressure. This work provides a low-cost, scalable methodology for realistic LLM safety assessment. We will release our code and evaluation scripts upon publication to future research.
Problem

Research questions and friction points this paper is trying to address.

adversarial risk
large language models
Best-of-N sampling
safety evaluation
attack success rate
Innovation

Methods, ideas, or system contributions that make the work stand out.

Best-of-N sampling
adversarial risk estimation
scaling law
Beta-Bernoulli model
LLM safety
🔎 Similar Papers
No similar papers found.