🤖 AI Summary
Despite safety alignment, large language models (LLMs) remain vulnerable to simple rephrasing attacks—such as tense-based jailbreaking—exposing a critical deficiency in the temporal generalization of their refusal mechanisms. This work proposes the first mechanism-aware defense framework: (1) circuit analysis identifies attention heads critical to tense-related attacks; (2) causal attribution traces the propagation path of refusal signals; and (3) channel-wise activation scaling and preventive fine-tuning jointly recalibrate internal refusal behavior without degrading general capabilities. Evaluated on three mainstream LLMs, our method significantly reduces success rates of targeted jailbreaking attacks while mitigating over-refusal and preserving the utility–safety Pareto frontier. The core contribution lies in uncovering and repairing temporal sensitivity vulnerabilities at the neural mechanistic level—bridging interpretability and robustness through principled intervention.
📝 Abstract
Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. For the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking, the tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across three LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.