Delta Attention Residuals

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

236K/year
🤖 AI Summary
This work addresses the issue of routing collapse in deep networks, where standard attention residuals—relying on highly redundant accumulated hidden states—struggle to effectively select relevant information from preceding layers. To overcome this limitation, the paper introduces Delta Attention Residuals, a novel approach that performs attention-based routing on inter-layer deltas (i.e., changes) rather than on cumulative hidden states. This paradigm shift significantly enhances the precision of cross-layer information selection and increases the contrastiveness of attention distributions. The method seamlessly integrates with standard softmax attention mechanisms and enables smooth transfer to pretrained models. Evaluated across model scales ranging from 220M to 7.6B parameters, Delta Attention Residuals consistently outperforms both standard and original attention residuals, achieving validation perplexity improvements of 1.7%–8.2% over existing approaches.
📝 Abstract
Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross-layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low-contrast and closer to uniform (max weight ${\approx}$0.2), limiting the model's ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed in Attention Residuals? To answer this question, we propose Delta Attention Residuals, which attend over deltas -- the change introduced by each sublayer ($\mathbf{v}_i = \mathbf{h}_{i+1} - \mathbf{h}_i$) -- instead of cumulative states. Delta representations are structurally diverse and yield higher-contrast attention distributions (max weight ${\approx}$0.6), enabling more selective and effective routing across layers. This principle applies at both per-sublayer and block granularity. Across all tested scales (220M--7.6B), Delta Attention Residuals consistently outperform both standard residuals and Attention Residuals, with 1.7--8.2\% validation perplexity gains. Delta Attention Residuals also enables converting pretrained checkpoints into Delta Attention Residuals via standard fine-tuning. Code is available at https://github.com/wdlctc/delta-attention-residuals-code.
Problem

Research questions and friction points this paper is trying to address.

Attention Residuals
routing collapse
redundancy
layer-wise representations
attention weights
Innovation

Methods, ideas, or system contributions that make the work stand out.

Delta Attention Residuals
attention routing
layer-wise deltas
residual connections
transformer architecture
🔎 Similar Papers
No similar papers found.