Vision Transformers Need More Than Registers

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

This work addresses the susceptibility of Vision Transformers (ViTs) to interference from semantically irrelevant background regions across various supervision paradigms, which leads to spurious semantic representations. The study identifies the root cause as a “lazy aggregation” behavior in ViTs—where global attention mechanisms, combined with coarse-grained supervision, exploit background patches as optimization shortcuts. To mitigate this issue, the authors propose a selective fusion mechanism that adaptively integrates image patch features into the CLS token, thereby suppressing background-induced artifacts. This approach consistently improves performance across twelve benchmark datasets spanning label-based supervision, text-based supervision, and self-supervised learning, offering a novel perspective for both understanding and enhancing ViT architectures.

Technology Category

Application Category

📝 Abstract

Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior.

Problem

Research questions and friction points this paper is trying to address.

Vision Transformers

artifacts

global attention

semantic supervision

representation learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformers

artifact analysis

lazy aggregation