Sparse Attention Post-Training for Mechanistic Interpretability

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the poor interpretability and structural redundancy inherent in Transformer attention mechanisms. We propose a post-training sparsification method that optimizes pretrained attention weights under constrained sparsity regularization, incorporating structural priors to uncover intrinsic causal circuits—without degrading model performance. Our approach globally simplifies the computational circuit: attention connections are reduced to approximately 0.3% of their original count, and task-relevant computational edges decrease by up to 100×, while strictly preserving the original pretraining loss. Empirical evaluation confirms effectiveness on billion-parameter models. The core contribution lies in redefining sparsity as an interpretability-driven structural inductive bias—not merely an efficiency heuristic—enabling lossless, global circuit simplification and explicit mechanistic revelation.

Technology Category

Application Category

📝 Abstract

We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 1B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $approx 0.3 %$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models.

Problem

Research questions and friction points this paper is trying to address.

Makes transformer attention sparse without performance loss

Uses sparsity as a structural prior for interpretability

Simplifies global circuits by reducing connectivity components

Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-training sparsifies transformer attention via regularization

Retains performance while reducing connectivity to 0.3% edges

Sparsity simplifies global circuits for interpretability

🔎 Similar Papers

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models