🤖 AI Summary
This work investigates refusal behavior in large reasoning models (LRMs), demonstrating that it is not solely governed by residual stream activations but also dynamically modulated by internal chain-of-thought (CoT) processes, which limits the efficacy of conventional unidirectional activation steering. To address this, the authors propose a two-stage intervention: first regenerating the CoT under activation steering and then removing the steering to assess whether the CoT alone can encode and reconstruct compliance signals. Experiments on DeepSeek-R1-Distill-LLaMA-8B combine activation steering, CoT fixation, and ablation analyses. Results show that activation steering alone achieves only a 39% refusal reversal rate, which increases to 70% when the original CoT is removed. The two-stage intervention attains a 94% reversal rate, with the newly generated CoT retaining 48% effectiveness even without continued steering, confirming that CoT serves as an independent intervention surface capable of overcoming limitations of traditional methods.
📝 Abstract
Large reasoning models (LRMs) generate chain-of-thought (CoT) traces before producing final outputs, introducing a dynamic internal state that may complicate control mechanisms such as refusal. Unlike instruction-tuned LLMs, where refusal is mediated by a single directional subspace, refusal in large reasoning models (LRMs) additionally depends on the CoT. In DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in only 39% of cases when the CoT is kept fixed, but removing the CoT entirely increases this to 70%, indicating that the CoT actively reinforces refusal. In a two-stage intervention where the model regenerates its CoT under activation steering, refusal is reversed in 94% of cases, while the resulting CoT alone retains 48% of this effect even after steering is removed. This suggests that the CoT can carry and reconstruct the compliance signal independently. These findings indicate that refusal in LRMs is jointly encoded in residual stream activations and CoT. This joint activation makes LRM more robust against activation-level interventions alone, but exposes CoT to a possible alternative surface attack.