🤖 AI Summary
This work investigates the internal mechanisms by which reasoning models (e.g., DeepSeek-R1-Distill-Llama-8B) refuse harmful requests during chain-of-thought (CoT) generation, and their vulnerability to jailbreaking attacks. Methodologically, we employ activation-space directional analysis, attention intervention, and CoT activation replacement to identify and characterize a linear direction in the activation space that encodes “cautious” behavior—this direction predicts refusal decisions with high accuracy. Crucially, we demonstrate that perturbing CoT internal activations along this direction systematically degrades refusal capability, enabling efficient jailbreaking. Experiments show that such interventions significantly increase success rates across diverse prompt-based attacks. Our results reveal that CoT representations constitute a controllable yet attack-prone interface, challenging the assumption of inherent safety in reasoning models. This work provides both theoretical insight and empirical evidence into the security boundaries of reasoning models, laying foundations for novel defense paradigms grounded in representational analysis.
📝 Abstract
Reasoning models generate chain-of-thought (CoT) tokens before their final output, but how this affects their vulnerability to jailbreak attacks remains unclear. While traditional language models make refusal decisions at the prompt-response boundary, we find evidence that DeepSeek-R1-Distill-Llama-8B makes these decisions within its CoT generation. We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply -- termed the "caution" direction because it corresponds to cautious reasoning patterns in the generated text. Ablating this direction from model activations increases harmful compliance, effectively jailbreaking the model. We additionally show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates. Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models.
Code available at https://github.com/ky295/reasoning-manipulation