Does Self-Attention Need Separate Weights in Transformers?

📅 2024-11-30
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost, weak directional modeling, and parameter redundancy of self-attention mechanisms, this paper systematically validates— for the first time—the feasibility of sharing weight matrices across keys (K), queries (Q), and values (V). We propose a BERT variant that replaces the conventional three distinct weight matrices with a single shared weight matrix in the self-attention module. This design reduces self-attention parameters by 66.53% and cuts overall training time by approximately 90%. On the GLUE benchmark, the model achieves a 0.38% absolute accuracy gain over standard BERT and demonstrates superior robustness to noisy data and stronger generalization in cross-domain transfer tasks. Our core contribution lies in empirically establishing that K/Q/V weight sharing is not only viable but also simultaneously improves efficiency and performance—introducing a novel paradigm for lightweight Transformer architectures.

Technology Category

Application Category

📝 Abstract
The success of self-attention lies in its ability to capture long-range dependencies and enhance context understanding, but it is limited by its computational complexity and challenges in handling sequential data with inherent directionality. This work introduces a shared weight self-attention-based BERT model that only learns one weight matrix for (Key, Value, and Query) representations instead of three individual matrices for each of them. Our shared weight attention reduces the training parameter size by more than half and training time by around one-tenth. Furthermore, we demonstrate higher prediction accuracy on small tasks of GLUE over the BERT baseline and in particular a generalization power on noisy and out-of-domain data. Experimental results indicate that our shared self-attention method achieves a parameter size reduction of 66.53% in the attention block. In the GLUE dataset, the shared weight self-attention-based BERT model demonstrates accuracy improvements of 0.38%, 5.81%, and 1.06% over the standard, symmetric, and pairwise attention-based BERT models, respectively. The model and source code are available at Anonymous.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational complexity in self-attention mechanisms
Improving efficiency with shared weight matrices for Key, Value, Query
Enhancing accuracy and generalization on noisy or out-of-domain data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shared weight matrix for Key, Value, Query
Reduces training parameters by over half
Improves accuracy on GLUE benchmark tasks
🔎 Similar Papers
No similar papers found.