ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Despite safety alignment, large language models (LLMs) remain vulnerable to simple rephrasing attacks—such as tense-based jailbreaking—exposing a critical deficiency in the temporal generalization of their refusal mechanisms. This work proposes the first mechanism-aware defense framework: (1) circuit analysis identifies attention heads critical to tense-related attacks; (2) causal attribution traces the propagation path of refusal signals; and (3) channel-wise activation scaling and preventive fine-tuning jointly recalibrate internal refusal behavior without degrading general capabilities. Evaluated on three mainstream LLMs, our method significantly reduces success rates of targeted jailbreaking attacks while mitigating over-refusal and preserving the utility–safety Pareto frontier. The core contribution lies in uncovering and repairing temporal sensitivity vulnerabilities at the neural mechanistic level—bridging interpretability and robustness through principled intervention.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. For the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking, the tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across three LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.

Problem

Research questions and friction points this paper is trying to address.

Mitigating targeted jailbreaking attacks on safety-aligned large language models

Addressing tense-based vulnerability that bypasses refusal mechanisms

Developing mechanistic framework to patch specific safety generalization gaps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies tense-vulnerable attention heads via circuit analysis

Trains channel-wise scaling vectors to recalibrate activations

Applies preventative fine-tuning for robust refusal mechanisms

🔎 Similar Papers

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models