🤖 AI Summary
Despite alignment and instruction tuning, large language models remain vulnerable to jailbreaking attacks. This work proposes Head-Masked Nullspace Steering (HMNS), a novel approach that uniquely integrates geometric awareness with interpretability. By leveraging causal analysis to identify critical attention heads, HMNS masks their write paths and injects nullspace-constrained perturbations into the orthogonal complement of the suppressed subspace. The method further incorporates residual norm scaling and iterative re-identification to establish a closed-loop detection-and-intervention mechanism. Evaluated across multiple mainstream models and jailbreaking benchmarks, HMNS achieves state-of-the-art attack success rates while demonstrating significantly higher query efficiency compared to existing techniques.
📝 Abstract
Large language models remain vulnerable to jailbreak attacks -- inputs designed to bypass safety mechanisms and elicit harmful responses -- despite advances in alignment and instruction tuning. We propose Head-Masked Nullspace Steering (HMNS), a circuit-level intervention that (i) identifies attention heads most causally responsible for a model's default behavior, (ii) suppresses their write paths via targeted column masking, and (iii) injects a perturbation constrained to the orthogonal complement of the muted subspace. HMNS operates in a closed-loop detection-intervention cycle, re-identifying causal heads and reapplying interventions across multiple decoding attempts. Across multiple jailbreak benchmarks, strong safety defenses, and widely used language models, HMNS attains state-of-the-art attack success rates with fewer queries than prior methods. Ablations confirm that nullspace-constrained injection, residual norm scaling, and iterative re-identification are key to its effectiveness. To our knowledge, this is the first jailbreak method to leverage geometry-aware, interpretability-informed interventions, highlighting a new paradigm for controlled model steering and adversarial safety circumvention.