Associative Transformer

πŸ“… 2023-09-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Sparse attention mechanisms suffer from limited parameter efficiency and inadequate capacity for complex relational reasoningβ€”e.g., visual relationship modeling. To address this, we propose a lightweight Vision Transformer (ViT) architecture augmented with associative memory. Our method introduces: (1) an explicit, learnable memory mechanism guided by the Hopfield energy function to strengthen long-range dependencies among local image patches; and (2) a bottleneck attention module guided by multiple priors, drastically reducing parameters while enhancing relational reasoning capability. Evaluated on four standard image classification benchmarks and the Sort-of-CLEVR relational reasoning task, our approach consistently outperforms ViT and state-of-the-art sparse Transformers, establishing new SOTA results. At comparable accuracy, it reduces model parameters by 38–52% and cuts network depth by 40%, achieving both computational efficiency and biologically plausible memory-driven inference.
πŸ“ Abstract
Emerging from the pairwise attention in conventional Transformers, there is a growing interest in sparse attention mechanisms that align more closely with localized, contextual learning in the biological brain. Existing studies such as the Coordination method employ iterative cross-attention mechanisms with a bottleneck to enable the sparse association of inputs. However, these methods are parameter inefficient and fail in more complex relational reasoning tasks. To this end, we propose Associative Transformer (AiT) to enhance the association among sparsely attended input patches, improving parameter efficiency and performance in relational reasoning tasks. AiT leverages a learnable explicit memory, comprised of various specialized priors, with a bottleneck attention to facilitate the extraction of diverse localized features. Moreover, we propose a novel associative memory-enabled patch reconstruction with a Hopfield energy function. The extensive experiments in four image classification tasks with three different sizes of AiT demonstrate that AiT requires significantly fewer parameters and attention layers while outperforming Vision Transformers and a broad range of sparse Transformers. Additionally, AiT establishes new SOTA performance in the Sort-of-CLEVR dataset, outperforming the previous Coordination method.
Problem

Research questions and friction points this paper is trying to address.

Improves parameter efficiency in sparse attention mechanisms
Enhances performance in vision tasks like classification
Outperforms existing sparse Transformer models in relational reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable explicit memory enhances token association.
Hopfield energy function aids token reconstruction.
Fewer parameters and layers improve efficiency.
πŸ”Ž Similar Papers
2024-05-10arXiv.orgCitations: 2
Y
Yuwei Sun
The University of Tokyo
H
H. Ochiai
The University of Tokyo
Z
Zhirong Wu
Microsoft Research
Stephen Lin
Stephen Lin
Microsoft Research Asia
computer vision
Ryota Kanai
Ryota Kanai
Araya, Inc.
ConsciousnessNeuroscienceInformationArtificial Intelligence