Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the vulnerability of large language models’ extended chain-of-thought reasoning to adversarial jailbreak and prompt injection attacks. To mitigate this, the authors propose Activation Consistency Training (ACT), a novel, interpretable, and efficient safety training paradigm that enforces consistency between internal activations elicited by clean prompts and those induced by adversarially rewritten prompts. ACT demonstrates strong robustness against adaptive attacks across multiple reasoning models: even when attackers manipulate the chain of thought, the model reliably refuses jailbreak attempts while preserving near-original performance on benign inputs. The study further reveals that ACT’s defense mechanism stems from an approximately linear shift in activation space, enabling the extraction of a single, effective refusal control direction.

📝 Abstract

As LLMs gain stronger reasoning capabilities, their extended chain-of-thought introduces new degrees of complexity for defending against adversarial jailbreaks and prompt injection. We study consistency training, a family of fine-tuning objectives that enforce identical behavior on clean prompts and adversarial rewrites, and evaluate its two main variants, output-level (BCT) and activation-level (ACT), across five reasoning models. We formulate both methods as a prompt injection defense and find ACT to be competitive with other training-based defenses while requiring only self-supervised pairs of clean and wrapped prompts. Our experiments also generalize both techniques within the jailbreak setting, demonstrating that ACT remains more robust to adaptive attacks. We also provide mechanistic evidence that ACT's defense against jailbreaks is encoded as a roughly linear shift in activation space at the assistant-turn boundary. After ACT training, we can recover a single steering direction that controls refusal on reasoning models with minimal effect on benign inputs. We find that ACT remains robust even when the model's chain-of-thought is replaced with a compliant trace from the undefended base model, pivoting to refuse prefilled jailbreaks. Together, these results suggest that supervising internal representations is a surprisingly effective and interpretable approach to various forms of safety training in reasoning models.

Problem

Research questions and friction points this paper is trying to address.

adversarial attacks

prompt injection

jailbreak

reasoning models

safety training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Activation Consistency Training

adversarial jailbreaks

prompt injection