🤖 AI Summary
This work addresses catastrophic forgetting in vision transformers during continual learning, which arises from attention drift across tasks. To mitigate this issue, the study introduces a biologically inspired selective attention mechanism from neuroscience into the continual learning framework. Specifically, it proposes an attention map–based gradient masking approach that leverages inter-layer rollout to generate attention maps from previously learned tasks. These maps are used to construct instance-adaptive binary masks that selectively suppress gradient updates in regions corresponding to already-learned visual concepts during backpropagation. Combined with a parameter update scaling strategy, the method effectively stabilizes learned representations. Extensive experiments demonstrate that the proposed approach significantly alleviates forgetting across diverse continual learning scenarios, achieving state-of-the-art performance while precisely preserving critical visual features.
📝 Abstract
Continual learning (CL) empowers AI systems to progressively acquire knowledge from non-stationary data streams. However, catastrophic forgetting remains a critical challenge. In this work, we identify attention drift in Vision Transformers as a primary source of catastrophic forgetting, where the attention to previously learned visual concepts shifts significantly after learning new tasks. Inspired by neuroscientific insights into the selective attention in the human visual system, we propose a novel attention-retaining framework to mitigate forgetting in CL. Our method constrains attention drift by explicitly modifying gradients during backpropagation through a two-step process: 1) extracting attention maps of the previous task using a layer-wise rollout mechanism and generating instance-adaptive binary masks, and 2) when learning a new task, applying these masks to zero out gradients associated with previous attention regions, thereby preventing disruption of learned visual concepts. For compatibility with modern optimizers, the gradient masking process is further enhanced by scaling parameter updates proportionally to maintain their relative magnitudes. Experiments and visualizations demonstrate the effectiveness of our method in mitigating catastrophic forgetting and preserving visual concepts. It achieves state-of-the-art performance and exhibits robust generalizability across diverse CL scenarios.