$C$-$\Delta\Theta$: Circuit-Restricted Weight Arithmetic for Selective Refusal

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Large language models often rely on runtime interventions to enforce safety policies during deployment, incurring persistent computational overhead and increased system complexity. This work proposes an offline editing approach that, for the first time, shifts selective refusal entirely to the post-training phase. By leveraging EAP-IG to identify causal refusal circuits—typically comprising fewer than 5% of model parameters—and applying constrained weight updates ΔΘ<sub>C</sub> exclusively to this sparse subnetwork, the method achieves category-specific refusal behavior while preserving general capabilities. Evaluated on both refusal and utility benchmarks, the approach eliminates the need for runtime hooks, thereby significantly enhancing deployment efficiency without compromising performance.

Technology Category

Application Category

📝 Abstract

Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-{\Delta}{\theta}: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update {\Delta}{\theta}C supported only on that circuit (typically<5% of parameters). Applying {\Delta}{\theta}C yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.

Problem

Research questions and friction points this paper is trying to address.

selective refusal

inference-time intervention

offline weight update

safety policies

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

circuit-restricted weight update

selective refusal

activation steering