Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the critical limitation of existing evaluation methods, which severely underestimate the safety risks of large language models under realistic large-scale parallel adversarial attacks by relying on single or small-scale prompt testing. To bridge this gap, we propose SABER, a scaling-aware risk estimation framework that, for the first time, models the per-sample attack success probability using a Beta-Bernoulli conjugate prior and derives an analytically tractable scaling law. This enables accurate extrapolation of the success rate under large-scale Best-of-N attacks from as few as 100 samples. Our method uncovers a nonlinear amplification of risk under parallel adversarial pressure and achieves a mean absolute error of only 1.66 in predicting ASR@1000—reducing error by 86.2% compared to baselines—thereby substantially improving both the accuracy and efficiency of safety evaluations.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting, which underestimates real-world risk. In practice, attackers can exploit large-scale parallel sampling to repeatedly probe a model until a harmful response is produced. While recent work shows that attack success increases with repeated sampling, principled methods for predicting large-scale adversarial risk remain limited. We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling. We model sample-level success probabilities using a Beta distribution, the conjugate prior of the Bernoulli distribution, and derive an analytic scaling law that enables reliable extrapolation of large-N attack success rates from small-budget measurements. Using only n=100 samples, our anchored estimator predicts ASR@1000 with a mean absolute error of 1.66, compared to 12.04 for the baseline, which is an 86.2% reduction in estimation error. Our results reveal heterogeneous risk scaling profiles and show that models appearing robust under standard evaluation can experience rapid nonlinear risk amplification under parallel adversarial pressure. This work provides a low-cost, scalable methodology for realistic LLM safety assessment. We will release our code and evaluation scripts upon publication to future research.

Problem

Research questions and friction points this paper is trying to address.

adversarial risk

large language models

Best-of-N sampling

safety evaluation

attack success rate

Innovation

Methods, ideas, or system contributions that make the work stand out.

Best-of-N sampling

adversarial risk estimation

scaling law

Beta-Bernoulli model

LLM safety

🔎 Similar Papers

Robust LLM safeguarding via refusal feature adversarial training

2024-09-30arXiv.orgCitations: 4

Scaling Trends in Language Model Robustness

2024-07-25Citations: 2

💼 Related Jobs

Research Engineer, Monetization AI