Vision Transformers with Self-Distilled Registers

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Vision Transformers (ViTs) often generate artifact tokens inconsistent with local semantics, degrading fine-grained localization and structural consistency. To address this, we propose Posterior-Injected Register (PH-Reg), a retraining-free self-distillation framework that, for the first time, enables posterior injection of register tokens into frozen pre-trained ViTs. PH-Reg leverages test-time augmentation-based dense embedding denoising, randomly initialized register token fine-tuning, selective weight unlocking, and label-free knowledge distillation. Under zero-shot and linear-probe evaluation protocols, PH-Reg significantly suppresses artifact tokens, yielding measurable improvements in structural consistency and localization accuracy on semantic segmentation and depth prediction tasks. Unlike prior methods requiring architectural modification or full model retraining, PH-Reg is lightweight, plug-and-play, and compatible with off-the-shelf ViT backbones—offering a practical robustness enhancement for large-scale pre-trained ViTs without compromising deployment efficiency.

Technology Category

Application Category

📝 Abstract

Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with the local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is to the addition of register tokens to ViTs, which implicitly"absorb"the artifact term during training. Given the availability of various large-scale pre-trained ViTs, in this paper we aim at equipping them with such register tokens without the need of re-training them from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (PH-Reg), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher's inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.

Problem

Research questions and friction points this paper is trying to address.

Mitigate artifact tokens in Vision Transformers degrading performance

Add register tokens to pre-trained ViTs without full retraining

Improve segmentation and depth prediction via self-distilled registers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-distilled registers for artifact reduction

Post Hoc Registers without full retraining

Test-time augmentation for denoised embeddings

🔎 Similar Papers

No similar papers found.