Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

📅 2026-04-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

238K/year
🤖 AI Summary
Despite alignment and instruction tuning, large language models remain vulnerable to jailbreaking attacks. This work proposes Head-Masked Nullspace Steering (HMNS), a novel approach that uniquely integrates geometric awareness with interpretability. By leveraging causal analysis to identify critical attention heads, HMNS masks their write paths and injects nullspace-constrained perturbations into the orthogonal complement of the suppressed subspace. The method further incorporates residual norm scaling and iterative re-identification to establish a closed-loop detection-and-intervention mechanism. Evaluated across multiple mainstream models and jailbreaking benchmarks, HMNS achieves state-of-the-art attack success rates while demonstrating significantly higher query efficiency compared to existing techniques.

Technology Category

Application Category

📝 Abstract
Large language models remain vulnerable to jailbreak attacks -- inputs designed to bypass safety mechanisms and elicit harmful responses -- despite advances in alignment and instruction tuning. We propose Head-Masked Nullspace Steering (HMNS), a circuit-level intervention that (i) identifies attention heads most causally responsible for a model's default behavior, (ii) suppresses their write paths via targeted column masking, and (iii) injects a perturbation constrained to the orthogonal complement of the muted subspace. HMNS operates in a closed-loop detection-intervention cycle, re-identifying causal heads and reapplying interventions across multiple decoding attempts. Across multiple jailbreak benchmarks, strong safety defenses, and widely used language models, HMNS attains state-of-the-art attack success rates with fewer queries than prior methods. Ablations confirm that nullspace-constrained injection, residual norm scaling, and iterative re-identification are key to its effectiveness. To our knowledge, this is the first jailbreak method to leverage geometry-aware, interpretability-informed interventions, highlighting a new paradigm for controlled model steering and adversarial safety circumvention.
Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks
large language models
safety mechanisms
adversarial safety circumvention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Nullspace Steering
Jailbreak Attack
Attention Head Intervention
Closed-loop Detection
Geometry-aware Perturbation