🤖 AI Summary
This study investigates how scaling language model size affects adversarial robustness: amid increasingly accessible computational resources, which side—attacker or defender—benefits more from model scaling? We systematically evaluate models spanning three orders of magnitude in parameter count under diverse adversarial threats—including jailbreaking and prompt injection—using adversarial training, cross-threat transfer analysis, and multi-scale robustness evaluation. Our key findings are: (1) model scale alone does not inherently improve robustness; (2) larger models exhibit higher sample efficiency during adversarial training and stronger generalization across threat types; and (3) defensive gains compound with scale—under identical defense compute budgets, larger models demonstrate significantly enhanced resilience against novel attacks. Collectively, these results indicate that the defender ultimately attains a scaling advantage.
📝 Abstract
Language models exhibit scaling laws, whereby increasing model and dataset size predictably decrease negative log likelihood, unlocking a dazzling array of capabilities. At the same time, even the most capable systems are currently vulnerable to adversarial inputs such as jailbreaks and prompt injections, despite concerted efforts to make them robust. As compute becomes more accessible to both attackers and defenders, which side will benefit more from scale? We attempt to answer this question with a detailed study of robustness on language models spanning three orders of magnitude in parameter count. From the defender's perspective, we find that in the absence of other interventions, increasing model size alone does not consistently improve robustness. In adversarial training, we find that larger models are more sample-efficient and less compute-efficient than smaller models, and often better generalize their defense to new threat models. From the attacker's perspective, we find that increasing attack compute smoothly and reliably increases attack success rate against both finetuned and adversarially trained models. Finally, we show that across model sizes studied, doubling compute on adversarial training only forces an attacker to less than double attack compute to maintain the same attack success rate. However, adversarial training becomes more and more effective on larger models, suggesting that defenders could eventually have the advantage with increasing model size. These results underscore the value of adopting a scaling lens when discussing robustness of frontier models.