π€ AI Summary
Existing multi-head attention variants (MHA, GQA, GTA) suffer from two key limitations: (1) increasing the number of heads dilutes individual head capacity, and (2) head outputs are merely concatenated without feature-level interaction. This paper proposes *Knock-head Attention*, the first attention mechanism to introduce a learnable, cross-head *pre-attention interaction*βapplied before attention computation. Specifically, it employs a shared, diagonally initialized projection matrix to enable efficient inter-head feature fusion while preserving head-specificity. The design is fully compatible with mainstream multi-head architectures and incurs negligible additional parameters or computational overhead. Evaluated on a 6.1B-parameter Mixture-of-Experts model, Knock-head Attention yields more stable training dynamics and achieves significant performance gains across multiple downstream tasks over strong baselines. Its core contribution is the introduction of a novel *pre-attention inter-head interaction paradigm*, which enhances representational capacity and generalization of multi-head attention at minimal cost.
π Abstract
Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of heads inherently weakens individual head capacity, and existing attention mechanisms - whether standard MHA or its variants like grouped-query attention (GQA) and grouped-tied attention (GTA) - simply concatenate outputs from isolated heads without strong interaction. To address this limitation, we propose knocking-heads attention (KHA), which enables attention heads to "knock" on each other - facilitating cross-head feature-level interactions before the scaled dot-product attention. This is achieved by applying a shared, diagonally-initialized projection matrix across all heads. The diagonal initialization preserves head-specific specialization at the start of training while allowing the model to progressively learn integrated cross-head representations. KHA adds only minimal parameters and FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention variants. We validate KHA by training a 6.1B parameter MoE model (1.01B activated) on 1T high-quality tokens. Compared to baseline attention mechanisms, KHA brings superior and more stable training dynamics, achieving better performance across downstream tasks.