Knocking-Heads Attention

πŸ“… 2025-10-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing multi-head attention variants (MHA, GQA, GTA) suffer from two key limitations: (1) increasing the number of heads dilutes individual head capacity, and (2) head outputs are merely concatenated without feature-level interaction. This paper proposes *Knock-head Attention*, the first attention mechanism to introduce a learnable, cross-head *pre-attention interaction*β€”applied before attention computation. Specifically, it employs a shared, diagonally initialized projection matrix to enable efficient inter-head feature fusion while preserving head-specificity. The design is fully compatible with mainstream multi-head architectures and incurs negligible additional parameters or computational overhead. Evaluated on a 6.1B-parameter Mixture-of-Experts model, Knock-head Attention yields more stable training dynamics and achieves significant performance gains across multiple downstream tasks over strong baselines. Its core contribution is the introduction of a novel *pre-attention inter-head interaction paradigm*, which enhances representational capacity and generalization of multi-head attention at minimal cost.

Technology Category

Application Category

πŸ“ Abstract
Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of heads inherently weakens individual head capacity, and existing attention mechanisms - whether standard MHA or its variants like grouped-query attention (GQA) and grouped-tied attention (GTA) - simply concatenate outputs from isolated heads without strong interaction. To address this limitation, we propose knocking-heads attention (KHA), which enables attention heads to "knock" on each other - facilitating cross-head feature-level interactions before the scaled dot-product attention. This is achieved by applying a shared, diagonally-initialized projection matrix across all heads. The diagonal initialization preserves head-specific specialization at the start of training while allowing the model to progressively learn integrated cross-head representations. KHA adds only minimal parameters and FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention variants. We validate KHA by training a 6.1B parameter MoE model (1.01B activated) on 1T high-quality tokens. Compared to baseline attention mechanisms, KHA brings superior and more stable training dynamics, achieving better performance across downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing cross-head feature interactions in attention mechanisms
Overcoming capacity weakening from increasing attention heads
Improving training stability and downstream task performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knocking-heads attention enables cross-head feature interactions
Shared diagonal projection matrix preserves head specialization
Minimal parameter increase integrates with existing attention variants
πŸ”Ž Similar Papers
No similar papers found.