Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing circuit discovery methods face a trade-off between efficiency and fidelity: attribution patching is efficient but deviates from the full model, whereas edge pruning is faithful yet computationally expensive. This paper proposes the Hybrid Attribution-and-Pruning (HAP) framework, which first employs attribution patching to rapidly identify high-contribution subgraphs, then applies edge pruning to extract faithful, sparse Transformer circuits from them. HAP jointly leverages both techniques—marking the first such integration in circuit discovery—thereby preserving cooperative components (e.g., S-inhibition attention heads) while balancing speed and fidelity. Experiments on indirect object identification demonstrate that HAP achieves a 46% speedup over baseline methods while retaining critical functional circuit structures. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Interpreting language models often involves circuit analysis, which aims to identify sparse subnetworks, or circuits, that accomplish specific tasks. Existing circuit discovery algorithms face a fundamental trade-off: attribution patching is fast but unfaithful to the full model, while edge pruning is faithful but computationally expensive. This research proposes a hybrid attribution and pruning (HAP) framework that uses attribution patching to identify a high-potential subgraph, then applies edge pruning to extract a faithful circuit from it. We show that HAP is 46% faster than baseline algorithms without sacrificing circuit faithfulness. Furthermore, we present a case study on the Indirect Object Identification task, showing that our method preserves cooperative circuit components (e.g. S-inhibition heads) that attribution patching methods prune at high sparsity. Our results show that HAP could be an effective approach for improving the scalability of mechanistic interpretability research to larger models. Our code is available at https://anonymous.4open.science/r/HAP-circuit-discovery.
Problem

Research questions and friction points this paper is trying to address.

Develops hybrid framework for faster faithful circuit discovery in transformers
Resolves trade-off between speed and faithfulness in model interpretation
Improves scalability of mechanistic interpretability for larger language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid framework combines attribution and pruning
Uses attribution patching to identify high-potential subgraph
Applies edge pruning to extract faithful circuits
🔎 Similar Papers
No similar papers found.
Hao Gu
Hao Gu
Sun Yat-Sen University
Planetary aeronomyAtmospheric escapeSpace physics
V
Vibhas Nair
Algoverse AI Research
A
Amrithaa Ashok Kumar
Algoverse AI Research
J
Jayvart Sharma
Algoverse AI Research
R
Ryan Lagasse
Algoverse AI Research