Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing LRP-based interpretability methods for Transformers neglect positional encodings (PEs), leading to attribution distortion, violation of conservation constraints, and loss of joint position-token correlations. To address this, we propose the first position-aware LRP framework: modeling inputs as position-token pairs to jointly attribute lexical semantics and positional structure; deriving theoretically consistent backward propagation rules tailored to mainstream PEs—including Rotary, Learnable, and Absolute encodings; and incorporating conservation-aware optimization into the backward pass. Our method fills a critical gap in structured attribution for Transformers. Extensive experiments demonstrate significant improvements over state-of-the-art baselines across both vision and NLP tasks. Notably, it exhibits strong generalization on zero-shot large language models such as LLaMA-3. The complete implementation is publicly released.

Technology Category

Application Category

📝 Abstract

The development of effective explainability tools for Transformers is a crucial pursuit in deep learning research. One of the most promising approaches in this domain is Layer-wise Relevance Propagation (LRP), which propagates relevance scores backward through the network to the input space by redistributing activation values based on predefined rules. However, existing LRP-based methods for Transformer explainability entirely overlook a critical component of the Transformer architecture: its positional encoding (PE), resulting in violation of the conservation property, and the loss of an important and unique type of relevance, which is also associated with structural and positional features. To address this limitation, we reformulate the input space for Transformer explainability as a set of position-token pairs. This allows us to propose specialized theoretically-grounded LRP rules designed to propagate attributions across various positional encoding methods, including Rotary, Learnable, and Absolute PE. Extensive experiments with both fine-tuned classifiers and zero-shot foundation models, such as LLaMA 3, demonstrate that our method significantly outperforms the state-of-the-art in both vision and NLP explainability tasks. Our code is publicly available.

Problem

Research questions and friction points this paper is trying to address.

Existing LRP methods ignore Transformer positional encoding in explainability

Current approaches violate conservation property and lose positional relevance

Proposing new LRP rules for positional-token pairs to improve explainability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulate input as position-token pairs

Specialized LRP rules for positional encoding

Outperforms state-of-the-art explainability tasks

🔎 Similar Papers

Attention Mechanisms Don't Learn Additive Models: Rethinking Feature Importance for Transformers