SafeSeek: Universal Attribution of Safety Circuits in Language Models

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work proposes SafeSeek, a novel framework that addresses the limitations of existing heuristic-based safety attribution methods, which struggle to reliably and universally identify functional components governing safe behavior in large language models. SafeSeek enables unified circuit attribution across diverse safety scenarios—such as backdoor attacks and alignment—by moving beyond conventional focus on isolated neurons or attention heads. Leveraging differentiable binary masks and gradient-based optimization, it automatically extracts sparse yet functionally complete safety circuits at multiple granularities, enabling precise intervention and efficient fine-tuning. Experiments demonstrate that in backdoor settings, SafeSeek identifies critical circuits at only 0.42% sparsity, reducing attack success from 100% to 0.4%. In alignment contexts, removing just 0.79% of neurons elevates attack success to 96.9%, whereas preserving the identified circuit maintains 96.5% safety.

Technology Category

Application Category

📝 Abstract

Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits in LLMs via optimization. Unlike methods focusing on isolated heads or neurons, \ourmethod introduces differentiable binary masks to extract multi-granular circuits through gradient descent on safety datasets, while integrates Safety Circuit Tuning to utilize these sparse circuits for efficient safety fine-tuning. We validate \ourmethod in two key scenarios in LLM safety: \textbf{(1) backdoor attacks}, identifying a backdoor circuit with 0.42\% sparsity, whose ablation eradicates the Attack Success Rate (ASR) from 100\% $\to$ 0.4\% while retaining over 99\% general utility; \textbf{(2) safety alignment}, localizing an alignment circuit with 3.03\% heads and 0.79\% neurons, whose removal spikes ASR from 0.8\% $\to$ 96.9\%, whereas excluding this circuit during helpfulness fine-tuning maintains 96.5\% safety retention.

Problem

Research questions and friction points this paper is trying to address.

safety attribution

language models

mechanistic interpretability

safety circuits

generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

mechanistic interpretability

safety circuits

differentiable binary masks