🤖 AI Summary
Whether multi-head attention possesses theoretical advantages beyond parallelism remains an open question. Method: We propose a computational graph co-design framework, modeling multi-head attention as a feedforward directed acyclic graph (DAG) system with a shared sink node. By integrating mixed-time analysis with min-max fidelity theory, we formally characterize inter-head diversity as a driver of cooperative acceleration in information propagation and fidelity enhancement. Contribution/Results: Our analysis breaks the conventional reliance on parallelism alone to explain multi-head superiority. Under strictly controlled parameter budgets, empirical evaluation demonstrates that multi-head architectures significantly outperform single-head counterparts in both information propagation efficiency and downstream task performance—providing concrete evidence of cooperative effects. This work establishes a novel theoretical perspective and a verifiable analytical paradigm for understanding the core mechanism of Transformers.
📝 Abstract
Multi-head attention powers Transformer networks, the primary deep learning architecture behind the success of large language models (LLMs). Yet, the theoretical advantages of multi-head versus single-head attention, beyond mere parallel processing, remain underexplored. In this paper, we reframe multi-head attention as a system of potentially synergistic computational graphs, where each head functions as a feedforward directed acyclic graph (DAG) with a common sink state. We provide intuition and preliminary theoretical analysis of mixing time and minimax fidelity in this framework. Our results show that multi-head attention can synergistically enhance information propagation, yielding faster mixing times and minimax fidelity amplification under specific head-diversity conditions. Finally, we train single-head and multi-head Transformers, each with the same total number of parameters, on sequence manipulation tasks and empirically verify the predicted effects.