🤖 AI Summary
This study investigates whether advanced reasoning capabilities influence the robustness of large language models (LLMs) against adversarial prompt attacks.
Method: We construct an empirical stress-testing framework covering diverse attack types—including tree-of-attacks and XSS injection—and systematically evaluate 12 mainstream reasoning and non-reasoning LLMs across seven distinct adversarial prompt attacks.
Contribution/Results: We find that reasoning models exhibit marginally higher overall robustness (average attack success rate: 42.51% vs. 45.53% for non-reasoning models), yet their vulnerability is highly attack-type–dependent: under certain attacks, vulnerability increases by up to 32 percentage points, while under others, robustness improves by up to 29.8 percentage points. This work is the first to reveal a “category-specific vulnerability distribution” in LLM security, demonstrating that robustness cannot be characterized globally but varies significantly across attack classes. Consequently, we advocate for fine-grained, attack-aware security evaluation paradigms—moving beyond aggregate metrics toward context-sensitive, threat-informed assessment methodologies.
📝 Abstract
The introduction of advanced reasoning capabilities have improved the problem-solving performance of large language models, particularly on math and coding benchmarks. However, it remains unclear whether these reasoning models are more or less vulnerable to adversarial prompt attacks than their non-reasoning counterparts. In this work, we present a systematic evaluation of weaknesses in advanced reasoning models compared to similar non-reasoning models across a diverse set of prompt-based attack categories. Using experimental data, we find that on average the reasoning-augmented models are emph{slightly more robust} than non-reasoning models (42.51% vs 45.53% attack success rate, lower is better). However, this overall trend masks significant category-specific differences: for certain attack types the reasoning models are substantially emph{more vulnerable} (e.g., up to 32 percentage points worse on a tree-of-attacks prompt), while for others they are markedly emph{more robust} (e.g., 29.8 points better on cross-site scripting injection). Our findings highlight the nuanced security implications of advanced reasoning in language models and emphasize the importance of stress-testing safety across diverse adversarial techniques.