Multi-Head Attention Is a Multi-Player Game

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the inefficiencies in multi-head attention mechanisms, which are typically treated as a single optimizer during training, thereby overlooking the inter-head competition and coordination that constitute an implicit multi-agent game. The study formally models multi-head attention as an implicit potential game and reveals that gradient descent converges to a Nash equilibrium that is inefficient due to unpriced externalities. To mitigate this, the authors propose enhancing game efficiency by regulating the off-diagonal entries of the inter-head interaction matrix. Their approach integrates interaction matrix analysis, Barlow Twins decorrelation, log-determinant coordination constraints, and a novel GAME-LoRA fine-tuning strategy. Experiments demonstrate that the proposed game-theoretic metric Γ(G) significantly predicts hallucination (p<0.05), and GAME-LoRA reduces hallucination by 8% on average—up to 18%—without compromising knowledge retention, achieving a Pareto improvement.

Technology Category

Application Category

📝 Abstract

Modern transformer attention is internally multi-agent -- heads compete and coordinate -- yet we train it as if it were a monolithic optimizer. We formalize this gap: cross-entropy training induces an implicit potential game among heads, and gradient descent converges to Nash equilibria with potentially unbounded inefficiency due to unpriced externalities (redundancy, correlated errors). Our main result bounds the Price of Anarchy by $\Gamma(G)$, the off-diagonal mass of a head interaction matrix capturing weight and gradient coupling. Under mild smoothness assumptions, we prove that both \emph{excess hallucination probability} and \emph{excess head redundancy} scale with PoA, unifying two distinct failure modes into a single mechanism. The bound is prescriptive: regularization that reduces $\Gamma(G)$ provably tightens PoA. We instantiate this as GAME-LoRA, combining Barlow Twins decorrelation with log-determinant coordination pressure. Experiments validate the theory: $\Gamma(G)$ predicts hallucination ($p{<}0.05$), emergent coalitions exhibit selective coordination, and GAME-LoRA achieves up to 18\% hallucination reduction (8\% average) with no knowledge degradation -- a Pareto improvement inaccessible to methods ignoring the game structure.

Problem

Research questions and friction points this paper is trying to address.

Multi-Head Attention

Game Theory

Price of Anarchy

Hallucination

Redundancy

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-head attention

game theory

Price of Anarchy