Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

The intrinsic nature of token-wise feature interaction in Transformer attention remains poorly understood. Method: We propose Low-Rank Sparse Attention (Lorsa), the first structured and invertible decomposition of multi-head self-attention (MHSA) into interpretable low-rank (capturing global coordination) and sparse (encoding local specificity) components, enabled by dictionary learning and rigorous interpretability analysis. Contribution/Results: Applied to Llama-3.1-8B, Lorsa automatically identifies dedicated attention head families performing atomic arithmetic operations and refines canonical circuit patterns—including induction heads and successor heads. Compared to sparse autoencoders (SAEs), Lorsa achieves significantly improved circuit discovery capability while matching SAE-level interpretability. This establishes a new analytical paradigm for attention mechanisms that is both mathematically rigorous and cognitively transparent.

Technology Category

Application Category

📝 Abstract

We propose Low-Rank Sparse Attention (Lorsa), a sparse replacement model of Transformer attention layers to disentangle original Multi Head Self Attention (MHSA) into individually comprehensible components. Lorsa is designed to address the challenge of attention superposition to understand attention-mediated interaction between features in different token positions. We show that Lorsa heads find cleaner and finer-grained versions of previously discovered MHSA behaviors like induction heads, successor heads and attention sink behavior (i.e., heavily attending to the first token). Lorsa and Sparse Autoencoder (SAE) are both sparse dictionary learning methods applied to different Transformer components, and lead to consistent findings in many ways. For instance, we discover a comprehensive family of arithmetic-specific Lorsa heads, each corresponding to an atomic operation in Llama-3.1-8B. Automated interpretability analysis indicates that Lorsa achieves parity with SAE in interpretability while Lorsa exhibits superior circuit discovery properties, especially for features computed collectively by multiple MHSA heads. We also conduct extensive experiments on architectural design ablation, Lorsa scaling law and error analysis.

Problem

Research questions and friction points this paper is trying to address.

Disentangling MHSA into comprehensible sparse components

Understanding attention-mediated feature interactions in tokens

Improving interpretability and circuit discovery in Transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-Rank Sparse Attention replaces Transformer attention layers

Lorsa disentangles MHSA into comprehensible components

Automated interpretability analysis shows Lorsa's superior circuit discovery

🔎 Similar Papers

No similar papers found.