🤖 AI Summary
This work investigates how Transformers achieve compositional generalization—i.e., effective reasoning over novel combinations of primitives unseen during training. The authors introduce a novel hypernetwork-based reformulation of multi-head attention, wherein low-dimensional latent codes dynamically generate head-specific key-query transformations and nonlinear value mappings. This enables the model to recompose learned primitives into new configurations. The latent codes exhibit both semantic interpretability and cross-task reusability, and the latent space displays functional structure. Evaluated on the symbolic Raven’s Progressive Matrices benchmark, the approach significantly improves abstract reasoning accuracy and reveals that model scale and data volume jointly drive compositional generalization. The core contribution lies in reframing attention via hypernetwork-inspired reparameterization, yielding an interpretable, mechanistic account of compositional generalization in Transformers.
📝 Abstract
Transformers can under some circumstances generalize to novel problem instances whose constituent parts might have been encountered during training but whose compositions have not. What mechanisms underlie this ability for compositional generalization? By reformulating multi-head attention as a hypernetwork, we reveal that a composable, low-dimensional latent code specifies key-query specific operations. We find empirically that this latent code is predictive of the subtasks the network performs on unseen task compositions revealing that latent codes acquired during training are reused to solve unseen problem instances. To further examine the hypothesis that the intrinsic hypernetwork of multi-head attention supports compositional generalization, we ablate whether making the hypernetwork generated linear value network nonlinear strengthens compositionality. We find that this modification improves compositional generalization on abstract reasoning tasks. In particular, we introduce a symbolic version of the Raven Progressive Matrices human intelligence test which gives us precise control over the problem compositions encountered during training and evaluation. We demonstrate on this task how scaling model size and data enables compositional generalization in transformers and gives rise to a functionally structured latent space.