Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Current safety-aligned language models often erroneously reject legitimate cybersecurity queries due to superficial lexical similarities, leading to distorted capability evaluations. This work reframes alignment removal not as simple unblocking but as navigating a utility–risk trade-off frontier, and introduces a multidimensional evaluation framework to distinguish genuine capability deficits from strategic refusals. The study systematically assesses targeted de-alignment techniques—including authorized-context prompting, refusal-direction projection, representation control, and LoRA fine-tuning—on authorized security tasks. Experimental results demonstrate that task-specific LoRA fine-tuning increases success rates on safety-critical tasks to 0.87, while preserving general capabilities at 0.83 and suppressing unsafe compliance to 0.13, thereby validating the approach’s efficacy and controllability.

📝 Abstract

Safety-aligned language models often refuse cybersecurity requests whose wording resembles misuse, even when the task is authorized and defensive. This makes security evaluation ambiguous: a failed answer may reflect missing capability or refusal-policy intervention. Ablating Safety studies alignment removal as a controlled transformation-evaluation protocol for authorized security tasks, comparing authorized-context prompting, reversible refusal-direction activation projection, representation-control projections, and LoRA-based de-alignment or task adaptation. We evaluate refusal, attempt rate, validated security success, general-capability retention, instability, and out-of-scope unsafe compliance on Security-AR, a 60-prompt suite of authorized security, benign general, and non-operational spillover probes. The reported runs include a four-model projection pilot with 416 completions, a three-model Qwen2.5 LoRA extension with 1,980 held-out completions, representation and robustness sweeps, and executable secure-repair validators. Single-vector refusal projection raises mean security score only from 0.46 to 0.50 while increasing unsafe compliance from 0.10 to 0.47; rank-4 refusal-subspace projection reaches 0.51 while matching the aligned spillover rate. Task-only LoRA raises mean security score to 0.87 with general score 0.83 and unsafe compliance 0.13, while refusal-suppression with retention raises spillover to 0.27. These results support evaluating alignment removal as a utility-risk frontier, not as an uncensoring recipe, and treating compliance alone as neither competence nor safe deployment.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

language models

cybersecurity

refusal policy

authorized security tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

alignment removal

LoRA-based de-alignment

refusal projection