🤖 AI Summary
This paper addresses the lack of explicit modeling and dynamic regulation of attention mechanisms in Transformers. Inspired by the Attention Schema Theory from cognitive science, we propose ASAC—the first attention management framework that integrates this theory into deep learning. ASAC employs a Vector Quantized-Variational Autoencoder (VQ-VAE) to construct learnable Attention Abstractors and Controllers, enabling explicit representation, dynamic optimization, and hierarchical regulation of attention resources. The method significantly improves learning efficiency (faster convergence), generalization (higher classification accuracy), and robustness (enhanced resilience to noise, out-of-distribution inputs, and adversarial attacks), while supporting efficient few-shot learning and multi-task transfer. Extensive experiments demonstrate consistent performance gains across both vision and natural language tasks, alongside improved model interpretability.
📝 Abstract
Attention mechanisms have become integral in AI, significantly enhancing model performance and scalability by drawing inspiration from human cognition. Concurrently, the Attention Schema Theory (AST) in cognitive science posits that individuals manage their attention by creating a model of the attention itself, effectively allocating cognitive resources. Inspired by AST, we introduce ASAC (Attention Schema-based Attention Control), which integrates the attention schema concept into artificial neural networks. Our initial experiments focused on embedding the ASAC module within transformer architectures. This module employs a Vector-Quantized Variational AutoEncoder (VQVAE) as both an attention abstractor and controller, facilitating precise attention management. By explicitly modeling attention allocation, our approach aims to enhance system efficiency. We demonstrate ASAC's effectiveness in both the vision and NLP domains, highlighting its ability to improve classification accuracy and expedite the learning process. Our experiments with vision transformers across various datasets illustrate that the attention controller not only boosts classification accuracy but also accelerates learning. Furthermore, we have demonstrated the model's robustness and generalization capabilities across noisy and out-of-distribution datasets. In addition, we have showcased improved performance in multi-task settings. Quick experiments reveal that the attention schema-based module enhances resilience to adversarial attacks, optimizes attention to improve learning efficiency, and facilitates effective transfer learning and learning from fewer examples. These promising results establish a connection between cognitive science and machine learning, shedding light on the efficient utilization of attention mechanisms in AI systems.