🤖 AI Summary
Large language models often rely on runtime interventions to enforce safety policies during deployment, incurring persistent computational overhead and increased system complexity. This work proposes an offline editing approach that, for the first time, shifts selective refusal entirely to the post-training phase. By leveraging EAP-IG to identify causal refusal circuits—typically comprising fewer than 5% of model parameters—and applying constrained weight updates ΔΘ<sub>C</sub> exclusively to this sparse subnetwork, the method achieves category-specific refusal behavior while preserving general capabilities. Evaluated on both refusal and utility benchmarks, the approach eliminates the need for runtime hooks, thereby significantly enhancing deployment efficiency without compromising performance.
📝 Abstract
Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-{\Delta}{\theta}: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update {\Delta}{\theta}C supported only on that circuit (typically<5% of parameters). Applying {\Delta}{\theta}C yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.