Adversarial Manipulation of Reasoning Models using Internal Representations

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work investigates the internal mechanisms by which reasoning models (e.g., DeepSeek-R1-Distill-Llama-8B) refuse harmful requests during chain-of-thought (CoT) generation, and their vulnerability to jailbreaking attacks. Methodologically, we employ activation-space directional analysis, attention intervention, and CoT activation replacement to identify and characterize a linear direction in the activation space that encodes “cautious” behavior—this direction predicts refusal decisions with high accuracy. Crucially, we demonstrate that perturbing CoT internal activations along this direction systematically degrades refusal capability, enabling efficient jailbreaking. Experiments show that such interventions significantly increase success rates across diverse prompt-based attacks. Our results reveal that CoT representations constitute a controllable yet attack-prone interface, challenging the assumption of inherent safety in reasoning models. This work provides both theoretical insight and empirical evidence into the security boundaries of reasoning models, laying foundations for novel defense paradigms grounded in representational analysis.

Technology Category

Application Category

📝 Abstract

Reasoning models generate chain-of-thought (CoT) tokens before their final output, but how this affects their vulnerability to jailbreak attacks remains unclear. While traditional language models make refusal decisions at the prompt-response boundary, we find evidence that DeepSeek-R1-Distill-Llama-8B makes these decisions within its CoT generation. We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply -- termed the "caution" direction because it corresponds to cautious reasoning patterns in the generated text. Ablating this direction from model activations increases harmful compliance, effectively jailbreaking the model. We additionally show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates. Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models. Code available at https://github.com/ky295/reasoning-manipulation

Problem

Research questions and friction points this paper is trying to address.

Investigates vulnerability of reasoning models to jailbreak attacks via CoT tokens

Identifies linear 'caution' direction in activation space predicting refusal/compliance

Demonstrates CoT token manipulation controls model outputs and improves attack success

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear direction in activation space predicts refusal

Ablating caution direction increases harmful compliance

Intervening on CoT token activations controls outputs

🔎 Similar Papers

Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks