HauntAttack: When Attack Follows Reasoning as a Shadow

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Large Reasoning Models (LRMs) face an inherent security-reasoning trade-off imbalance when reasoning capability and harmfulness are deeply coupled. Method: We propose the first “reasoning-as-carrier” black-box attack paradigm, which implicitly embeds harmful instructions into the conditions of reasoning problems—thereby inducing models to generate unsafe outputs through ostensibly compliant reasoning paths. Our approach comprises a condition-substitution strategy, typed injection of harmful instructions, and a multi-model safety evaluation protocol. Contribution/Results: We empirically validate significant security vulnerabilities across multiple state-of-the-art LRMs. Crucially, we systematically uncover, for the first time, strong correlations between reasoning-path output patterns and specific harmful instruction types. Beyond exposing a novel class of security risks in LRMs, our work establishes a new benchmark for safety evaluation specifically tailored to reasoning-oriented foundation models.

Technology Category

Application Category

📝 Abstract

Emerging Large Reasoning Models (LRMs) consistently excel in mathematical and reasoning tasks, showcasing exceptional capabilities. However, the enhancement of reasoning abilities and the exposure of their internal reasoning processes introduce new safety vulnerabilities. One intriguing concern is: when reasoning is strongly entangled with harmfulness, what safety-reasoning trade-off do LRMs exhibit? To address this issue, we introduce HauntAttack, a novel and general-purpose black-box attack framework that systematically embeds harmful instructions into reasoning questions. Specifically, we treat reasoning questions as carriers and substitute one of their original conditions with a harmful instruction. This process creates a reasoning pathway in which the model is guided step by step toward generating unsafe outputs. Based on HauntAttack, we conduct comprehensive experiments on multiple LRMs. Our results reveal that even the most advanced LRMs exhibit significant safety vulnerabilities. Additionally, we perform a detailed analysis of different models, various types of harmful instructions, and model output patterns, providing valuable insights into the security of LRMs.

Problem

Research questions and friction points this paper is trying to address.

Investigates safety vulnerabilities in Large Reasoning Models (LRMs)

Proposes HauntAttack to embed harmful instructions in reasoning questions)

Analyzes security trade-offs in advanced LRMs' reasoning processes)

Innovation

Methods, ideas, or system contributions that make the work stand out.

Embeds harmful instructions into reasoning questions

Substitutes original conditions with harmful steps

Guides models step-by-step to unsafe outputs

🔎 Similar Papers

No similar papers found.