Scaling Trends in Language Model Robustness

📅 2024-07-25
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how scaling language model size affects adversarial robustness: amid increasingly accessible computational resources, which side—attacker or defender—benefits more from model scaling? We systematically evaluate models spanning three orders of magnitude in parameter count under diverse adversarial threats—including jailbreaking and prompt injection—using adversarial training, cross-threat transfer analysis, and multi-scale robustness evaluation. Our key findings are: (1) model scale alone does not inherently improve robustness; (2) larger models exhibit higher sample efficiency during adversarial training and stronger generalization across threat types; and (3) defensive gains compound with scale—under identical defense compute budgets, larger models demonstrate significantly enhanced resilience against novel attacks. Collectively, these results indicate that the defender ultimately attains a scaling advantage.

Technology Category

Application Category

📝 Abstract
Language models exhibit scaling laws, whereby increasing model and dataset size predictably decrease negative log likelihood, unlocking a dazzling array of capabilities. At the same time, even the most capable systems are currently vulnerable to adversarial inputs such as jailbreaks and prompt injections, despite concerted efforts to make them robust. As compute becomes more accessible to both attackers and defenders, which side will benefit more from scale? We attempt to answer this question with a detailed study of robustness on language models spanning three orders of magnitude in parameter count. From the defender's perspective, we find that in the absence of other interventions, increasing model size alone does not consistently improve robustness. In adversarial training, we find that larger models are more sample-efficient and less compute-efficient than smaller models, and often better generalize their defense to new threat models. From the attacker's perspective, we find that increasing attack compute smoothly and reliably increases attack success rate against both finetuned and adversarially trained models. Finally, we show that across model sizes studied, doubling compute on adversarial training only forces an attacker to less than double attack compute to maintain the same attack success rate. However, adversarial training becomes more and more effective on larger models, suggesting that defenders could eventually have the advantage with increasing model size. These results underscore the value of adopting a scaling lens when discussing robustness of frontier models.
Problem

Research questions and friction points this paper is trying to address.

Scaling effects on language model robustness.
Impact of model size on adversarial vulnerability.
Compute efficiency in adversarial training effectiveness.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling laws improve model capabilities.
Adversarial training enhances defense generalization.
Increased compute boosts attack success rate.
🔎 Similar Papers
No similar papers found.
N
Nikolaus H. R. Howe
FAR AI; Mila; Université de Montréal
I
Ian R. McKenzie
FAR AI
O
Oskar Hollinsworth
FAR AI
Michal Zajac
Michal Zajac
FAR AI
Tom Tseng
Tom Tseng
FAR AI
A
Aaron David Tucker
Pierre-Luc Bacon
Pierre-Luc Bacon
University of Montreal
reinforcement learningartificial intelligence
A
A. Gleave
FAR AI