DePass: Unified Feature Attributing by Simple Decomposed Forward Pass

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of fine-grained behavioral attribution in Transformer mechanistic interpretability, this paper proposes DePass—a unified feature attribution framework based on decomposed forward propagation. Its core innovation lies in decomposing hidden states into customizable additive components and performing a single forward pass while freezing attention scores and MLP activations, thereby enabling simultaneous token-level, component-level, and subspace-level attribution. DePass is the first method to support multi-level information-flow tracing—without auxiliary training and within a single forward pass—achieving both high fidelity and computational efficiency. Extensive evaluation across diverse attribution tasks demonstrates that DePass accurately localizes and interprets complex internal information pathways in Transformers, significantly improving attribution interpretability and structural consistency.

Technology Category

Application Category

📝 Abstract
Attributing the behavior of Transformer models to internal computations is a central challenge in mechanistic interpretability. We introduce DePass, a unified framework for feature attribution based on a single decomposed forward pass. DePass decomposes hidden states into customized additive components, then propagates them with attention scores and MLP's activations fixed. It achieves faithful, fine-grained attribution without requiring auxiliary training. We validate DePass across token-level, model component-level, and subspace-level attribution tasks, demonstrating its effectiveness and fidelity. Our experiments highlight its potential to attribute information flow between arbitrary components of a Transformer model. We hope DePass serves as a foundational tool for broader applications in interpretability.
Problem

Research questions and friction points this paper is trying to address.

Attributing Transformer behavior to internal computations
Providing faithful feature attribution without auxiliary training
Attributing information flow between arbitrary model components
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single decomposed forward pass for feature attribution
Decomposes hidden states into additive components
Propagates components with fixed attention and activations
🔎 Similar Papers
No similar papers found.
X
Xiangyu Hong
Department of Electronic Engineering, Tsinghua University
Che Jiang
Che Jiang
Tsinghua University
K
Kai Tian
Department of Electronic Engineering, Tsinghua University
B
Biqing Qi
Shanghai AI Laboratory
Youbang Sun
Youbang Sun
Assistant Researcher, Tsinghua University; Northeastern University; Texas A&M University
Distributed OptimizationMulti-Agent RLRiemannian OptimizationFederated Learning
N
Ning Ding
Department of Electronic Engineering, Tsinghua University
B
Bowen Zhou
Shanghai AI Laboratory