Interleaved Head Attention

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

This work addresses the limited inter-head interaction in standard multi-head attention mechanisms, which hinders their ability to model complex multi-step dependencies during reasoning. To overcome this, the authors propose a pseudo-head cross-interaction mechanism that generates multiple pseudo-heads for each attention head via learnable linear combinations, thereby enabling cross-head information mixing with negligible parameter overhead. Theoretical analysis demonstrates improved parameter efficiency, while empirical results validate its effectiveness: the method achieves performance gains of 10–20% on the RULER multi-key retrieval benchmark and improves accuracy by 5.8% and 2.8% (under majority voting) on the GSM8K and MATH-500 reasoning tasks, respectively, confirming its enhanced capacity for complex reasoning.

Technology Category

Application Category

📝 Abstract

Multi-Head Attention (MHA) is the core computational primitive underlying modern Large Language Models (LLMs). However, MHA suffers from a fundamental linear scaling limitation: $H$ attention heads produce exactly $H$ independent attention matrices, with no communication between heads during attention computation. This becomes problematic for multi-step reasoning, where correct answers depend on aggregating evidence from multiple parts of the context and composing latent token-to-token relations over a chain of intermediate inferences. To address this, we propose Interleaved Head Attention (IHA), which enables cross-head mixing by constructing $P$ pseudo-heads per head (typically $P=H$), where each pseudo query/key/value is a learned linear combination of all $H$ original queries, keys and values respectively. Interactions between pseudo-query and pseudo-key heads induce up to $P^2$ attention patterns per head with modest parameter overhead $\mathcal{O}(H^2P)$. We provide theory showing improved efficiency in terms of number of parameters on the synthetic Polynomial task (IHA uses $Θ(\sqrt{k}n^2)$ parameters vs. $Θ(kn^2)$ for MHA) and on the synthetic order-sensitive CPM-3 task (IHA uses $\lceil\sqrt{N_{\max}}\rceil$ heads vs. $N_{\max}$ for MHA). On real-world benchmarks, IHA improves Multi-Key retrieval on RULER by 10-20% (4k-16k) and, after fine-tuning for reasoning on OpenThoughts, improves GSM8K by 5.8% and MATH-500 by 2.8% (Majority Vote) over full attention.

Problem

Research questions and friction points this paper is trying to address.

Multi-Head Attention

cross-head communication

multi-step reasoning

attention heads

latent token-to-token relations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaved Head Attention

Multi-Head Attention

cross-head mixing