DeepCrossAttention: Supercharging Transformer Residual Connections

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional Transformers suffer from residual connection dilution—simple summation of residual terms risks attenuating salient features, thereby impairing deep-layer feature reuse. To address this, we propose a dynamic residual fusion mechanism: (1) input-dependent learnable weights for selective residual term weighting across layers; and (2) depth-wise cross-layer cross-attention to enhance inter-layer interaction and information focusing. Theoretically, under a low-rank assumption, we establish an analytical framework proving our method achieves superior accuracy–model-size trade-offs. Empirically, on language modeling tasks, it reduces perplexity significantly, accelerates training by 3× at equivalent performance, and incurs negligible parameter overhead.

Technology Category

Application Category

📝 Abstract
Transformer networks have achieved remarkable success across diverse domains, leveraging a variety of architectural innovations, including residual connections. However, traditional residual connections, which simply sum the outputs of previous layers, can dilute crucial information. This work introduces DeepCrossAttention (DCA), an approach that enhances residual learning in transformers. DCA employs learnable, input-dependent weights to dynamically combine layer outputs, enabling the model to selectively focus on the most relevant information in any of the previous layers. Furthermore, DCA incorporates depth-wise cross-attention, allowing for richer interactions between layers at different depths. Our language modeling experiments show that DCA achieves improved perplexity for a given training time. Moreover, DCA obtains the same model quality up to 3x faster while adding a negligible number of parameters. Theoretical analysis confirms that DCA provides an improved trade-off between accuracy and model size when the ratio of collective layer ranks to the ambient dimension falls below a critical threshold.
Problem

Research questions and friction points this paper is trying to address.

Enhances residual learning in transformers
Introduces DeepCrossAttention for dynamic layer output combination
Improves accuracy and model size trade-off
Innovation

Methods, ideas, or system contributions that make the work stand out.

DeepCrossAttention enhances residual learning
Uses learnable input-dependent weights
Incorporates depth-wise cross-attention
🔎 Similar Papers
No similar papers found.