🤖 AI Summary
Adversarial robustness of large language models (LLMs) remains challenging, especially without costly adversarial training.
Method: This work investigates how scaling computational resources during inference—without modifying model parameters or employing adversarial training—enhances robustness in models such as OpenAI o1-preview and o1-mini. We propose a training-free robustness enhancement paradigm based on adaptive inference scheduling and establish a comprehensive, standardized evaluation framework incorporating novel inference-directed attacks.
Contribution/Results: We are the first to systematically demonstrate that “inference compute” serves as a universal robustness lever: attack success rates decline significantly—and asymptotically approach zero—with increased inference time or compute budget. We quantitatively characterize the saturation regime and failure boundary of the compute–robustness trade-off and analyze their underlying causes. Experiments across multiple strong adversarial attacks confirm substantial robustness gains, validating inference-time compute scaling as a lightweight, general-purpose, training-free defense with strong theoretical and practical promise.
📝 Abstract
We conduct experiments on the impact of increasing inference-time compute in reasoning models (specifically OpenAI o1-preview and o1-mini) on their robustness to adversarial attacks. We find that across a variety of attacks, increased inference-time compute leads to improved robustness. In many cases (with important exceptions), the fraction of model samples where the attack succeeds tends to zero as the amount of test-time compute grows. We perform no adversarial training for the tasks we study, and we increase inference-time compute by simply allowing the models to spend more compute on reasoning, independently of the form of attack. Our results suggest that inference-time compute has the potential to improve adversarial robustness for Large Language Models. We also explore new attacks directed at reasoning models, as well as settings where inference-time compute does not improve reliability, and speculate on the reasons for these as well as ways to address them.