Attention as a Hypernetwork

📅 2024-06-09

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work investigates how Transformers achieve compositional generalization—i.e., effective reasoning over novel combinations of primitives unseen during training. The authors introduce a novel hypernetwork-based reformulation of multi-head attention, wherein low-dimensional latent codes dynamically generate head-specific key-query transformations and nonlinear value mappings. This enables the model to recompose learned primitives into new configurations. The latent codes exhibit both semantic interpretability and cross-task reusability, and the latent space displays functional structure. Evaluated on the symbolic Raven’s Progressive Matrices benchmark, the approach significantly improves abstract reasoning accuracy and reveals that model scale and data volume jointly drive compositional generalization. The core contribution lies in reframing attention via hypernetwork-inspired reparameterization, yielding an interpretable, mechanistic account of compositional generalization in Transformers.

Technology Category

Application Category

📝 Abstract

Transformers can under some circumstances generalize to novel problem instances whose constituent parts might have been encountered during training but whose compositions have not. What mechanisms underlie this ability for compositional generalization? By reformulating multi-head attention as a hypernetwork, we reveal that a composable, low-dimensional latent code specifies key-query specific operations. We find empirically that this latent code is predictive of the subtasks the network performs on unseen task compositions revealing that latent codes acquired during training are reused to solve unseen problem instances. To further examine the hypothesis that the intrinsic hypernetwork of multi-head attention supports compositional generalization, we ablate whether making the hypernetwork generated linear value network nonlinear strengthens compositionality. We find that this modification improves compositional generalization on abstract reasoning tasks. In particular, we introduce a symbolic version of the Raven Progressive Matrices human intelligence test which gives us precise control over the problem compositions encountered during training and evaluation. We demonstrate on this task how scaling model size and data enables compositional generalization in transformers and gives rise to a functionally structured latent space.

Problem

Research questions and friction points this paper is trying to address.

Mechanisms behind compositional generalization in Transformers

Hypernetwork reformulation of multi-head attention

Improving generalization on abstract reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hypernetwork reformulates multi-head attention

Latent code predicts unseen task compositions

Nonlinear value network enhances generalization

🔎 Similar Papers

No similar papers found.