🤖 AI Summary
This work identifies a key mechanism underlying attention map noise in Vision Transformers (ViTs): high-norm tokens induce spurious attention patterns, primarily due to sparse neuron activation. To address this, we propose Test-time Register—a zero-shot, training-free method that injects parameter-free auxiliary tokens and redirects features to migrate high-norm activations onto these untrained tokens, thereby purifying attention maps. We provide the first theoretical analysis and empirical validation that pre-trained ViTs inherently support register-like behavior at test time, at zero computational or training cost. The method is plug-and-play, significantly improving attention and feature map quality while enhancing model interpretability. Across multiple downstream vision tasks, it matches the performance of register methods requiring full retraining. Moreover, it successfully boosts interpretability in off-the-shelf vision-language models such as CLIP—without architectural modification or fine-tuning.
📝 Abstract
We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers -- the emergence of high-norm tokens that lead to noisy attention maps. We observe that in multiple models (e.g., CLIP, DINOv2), a sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens, leading to irregular attention patterns and degrading downstream visual processing. While the existing solution for removing these outliers involves retraining models from scratch with additional learned register tokens, we use our findings to create a training-free approach to mitigate these artifacts. By shifting the high-norm activations from our discovered register neurons into an additional untrained token, we can mimic the effect of register tokens on a model already trained without registers. We demonstrate that our method produces cleaner attention and feature maps, enhances performance over base models across multiple downstream visual tasks, and achieves results comparable to models explicitly trained with register tokens. We then extend test-time registers to off-the-shelf vision-language models to improve their interpretability. Our results suggest that test-time registers effectively take on the role of register tokens at test-time, offering a training-free solution for any pre-trained model released without them.